35^th International Conference
on Massive Storage Systems
and Technology (MSST 2019)
May 20 — 24, 2019

Technically Co-
Sponsored by

Hosted at
Santa Clara University
Santa Clara, CA

2019 Conference

MSST 2019, as is our tradition, focused on distributed storage system technologies, including persistent memory, new memory technologies, long-term data retention (tape, optical disks...), solid state storage (flash, MRAM, RRAM...), software-defined storage, OS- and file-system technologies, cloud storage, big data, and data centers (private and public). The conference focused on current challenges and future trends in storage technologies.

MSST 2019 included a day of tutorials, two days of invited papers, and two days of peer-reviewed research papers. The conference was held, once again, on the beautiful campus of Santa Clara University, in the heart of Silicon Valley.

Many Thanks to Our Sponsors!

7:30 — 9:00 Registration / Breakfast

9:00 — 9:05 Introduction

Sean Roberts, Tutorial Chair

9:05 — 12:30 IME Storage System (slides)

Dr. Jean-Yves Vet, DDN (bio)

Paul Nowoczynski, DDN

DDN’s IME (aka "the Infinite Memory Engine") is an all NAND-flash storage system which acts as a high-performance storage tier in a user’s overall storage environment. IME has been built from the ground up as a highly-available, clustered storage technology which provides millions of IOPs to applications and best-case media endurance properties. IME’s top-rated capabilities report the highest overall performance for the most demanding data workloads as recorded by the independent IO500 organization.

The tutorial will focus on IME’s tiering ability along with performance demonstrations in difficult workload scenarios. Configuration, usage, and monitoring of IME, which will be done on a live cluster, will all be covered and attendees can expect to obtain a reasonable sense of an IME environment’s look and feel.

12:30 — 1:30 Lunch

1:30 — 5:00 Expanding the World of Heterogeneous Memory Hierarchies: The Evolving
Non-Volatile Memory Story (slides)

Bill Gervasi, Nantero (bio)

Emerging technologies and storage options are challenging the traditional system architecture hierarchies and giving designers new variables to consider. Existing options include module level solutions such as 3DXpoint and NVDIMMs which bring data persistence onto the memory channel, each with a variety of tradeoffs in terms of cost, performance, and mechanical considerations. Emerging options include new non-volatile memory technologies capable of addressing the limitations of the current solutions with lower latency and predictable time to data persistence, a critical factor for high reliability data processing applications. Meanwhile, an increasing number of systems are moving towards distributed fabric-based backbones with heterogeneous computing elements as well, including but not limited to artificial intelligence and deep learning, but also in-memory computing and non-von Neumann processing.

This tutorial is targeted at system architects who can appreciate the complexity of a confusing number of options, and would like some insights about managing the complexity to solve real world problems. Some of the standards in process are new, such as NVDIMM-P, DDR5 NVRAM, or Gen-Z specifications, so this is an opportunity to learn about future developments as well. The tutorial will allocate time for attendees to share their system integration stories as well, making it a joint learning experience for all.

7:30 — 8:30 Registration / Breakfast

8:30 — 9:30 Keynote

Session Chair: Meghan McClelland

A Perspective on the Past and Future of Magnetic Hard Drives

Dr. Mark Kryder, Carnegie Mellon University (bio)

The market for data storage technology is expanding at a rapid pace. IDC predicts that the total worldwide data will increase from 33 Zytes in 2018 to 175 Zbytes in 2025. Moreover the capacity of hard drives has increased from 5 Mbytes in 1956 to 15 Tbytes today, which is an increase of 3 million fold, while reducing the weight from over a ton to 1.5 lbs. Historically, HDDs have been as critical to advances in computing as semiconductors and continue to be. The 5.25 inch disk drive, first introduced by Seagate in 1978, enabled the IBM PC, and the 2.5 and 1.8 inch disk drives enabled laptop computers. Although there has been a significant decrease in the areal density increase of hard drives, the industry is still well over an order of magnitude away from the theoretical limits to the density that may be achieved, and new technologies such as two-dimensional magnetic recording (TDMR), heat assisted magnetic recording HAMR, microwave assisted magnetic recording (MAMR) and bit patterned media recording (BPMR) promise to renew the areal density growth in the future. Although the capacity of data stored on a HDD has increased at a pace equal to that of Moore’s law for semiconductors, the improvement in performance of HDDs has lagged. This has contributed to making solid state drives (SSDs), based on flash technology attractive in both mobile devices and high performance applications, where either a limited amount of storage is needed or where fast read access time is critical. However, flash is more expensive than storage on hard drives, and this has limited its use in applications requiring large volumes of data, such as the cloud, which today is being enabled by HDDs. Recently, to improve performance the industry has announced that they will be introducing multiple actuators on their drives. This presentation will describe the evolution of high-density data storage technology and project what might be expected in the future.

9:30 — 10:00 Break

10:00 — 12:00 Storage in the Age of AI

Session Chair: Dr. Glenn K. Lockwood

Tiering and Life Cycle Management with AI/ML Workloads (slides)

Jacob Farmer, Cambridge Computer, Starfish Storage (bio)

This talk takes a quick look at the pressures that machine learning workloads put on traditional HPC storage systems and proposes that organizations who embrace machine learning will want to up their games when it comes to data life cycle management. The talk then explores common approaches to namespace management, data movement, and data life cycle policy enforcement.

Storage, Security, and Privacy in the age of ML (slides)

Dr. Aleatha Parker-Wood, Humu

Storage and Data Challenges for Production Machine Learning (slides)

Dr. Nisha Talagala, Pyxeda AI (bio)

Machine Learning and Advanced Analytics are some of the most exciting and promising uses of the masses of data accumulated and stored over the last decade. However, as industries race to monetize the insights hidden in their data stores, new challenges emerge for storage and data management. The performance needs of AI workloads have been known for some time and Flash, for example, has been successfully applied to mitigate some of these challenges. As AI usage grows and becomes more dynamic and distributed (such as on edge), these performance requirements continue to expand and be coupled with other needs such as power efficiency. Secondly, as AI moves to production, other concerns are emerging, such regulatory requirements and business’ needs to demonstrate AI trustworthiness while managing risk. These requirements generate new data challenges from security to provenance and governance. This talk will describe recent trends and focus areas in AI (such as productization, trust and distributed execution) and how they create challenges and opportunities for storage and data management systems. The talk will also cover how storage systems are used in production AI workflows and how innovations in storage and data management can impact and improve the production AI lifecycle.

I/O for Deep Learning at Scale (slides)

Quincey Koziol, National Energy Research Scientific Computing Center (NERSC)

Deep Learning is revolutionizing the fields of computer vision, speech recognition and control systems. In recent years, a number of scientific domains (climate, high-energy physics, nuclear physics, astronomy, cosmology, etc) have explored applications of Deep Learning to tackle a range of data analytics problems. As one attempts to scale Deep Learning to analyze massive scientific datasets on HPC systems, data management becomes a key bottleneck. This talk will explore leading scientific use cases of Deep Learning in climate, cosmology, and high-energy physics on NERSC and OLCF platforms; enumerate I/O challenges and speculate about potential solutions.

12:00 — 1:15 Lunch

1:15 — 2:45 Computational Memory and Storage for AI

Session Chair: Dr. Michał Simon

Storage in the New Age of AI/ML (slides)

Young Paik, Samsung (bio)

One of the hottest topics today is Artificial Intelligence/Machine Learning. Most of the attention has been on the enormous increases in computational power now possible with GPU/ASIC servers. Much less time has been spent on what is arguably just as important: the storage of the data that feeds these hungry beasts. There are many technologies that may be used (e.g. PCIe Gen4, erasure coding, smart storage, SmartNICs). However, in designing new storage architectures it is important to realize where these may not work well together. Young will describe some of the characteristics of machine learning systems, the methods to process the data to feed them, and what considerations should go into designing the storage for them.

How NVM Express and Computational Storage can make your AI Applications Shine! (slides)

Dr. Stephen Bates, Eideticom

Artificial Intelligence and Machine Learning are becoming dominant workloads in both data-centers and on the edge. However AI/ML require large amounts of input data for both training and inference. In this talk we will discuss how NVM Express and Computational Storage can vastly improve the performance and efficiency of AI/ML systems. We will give an introduction to both technologies and show how they can be deployed and the benefits to expect from them.

Changing Storage Architecture will require new Standards (slides)

Mark Carlson, Toshiba Memory Corporation

2:45 — 3:00 Break

3:00 — 4:30 User Requirements of Storage at Scale

Session Chair: Kristy Kallback-Rose

Storage systems requirements for massive throughput detectors at light sources (slides)

Dr. Amedeo Perazzo, SLAC National Accelerator Laboratory (bio)

This presentation describes the storage systems requirements for the upgrade of the Linac Coherent Light Source, LCLS-II, which will start operations at SLAC in 2021. These systems face formidable challenges due to the extremely high data throughput generated by the detectors and to the intensive computational demand for data processing and scientific interpretation.

NWSC Storage: A look at what users need (slides)

Chris Hoffman, National Center for Atmospheric Research (bio)

The NCAR HPC Storage Team provides large scale storage systems at the NCAR-Wyoming Supercomputing Center (NWSC). The requirements of the NWSC storage environment vary based on many different requirements in areas such as atmospheric research, university research, and other domain specific areas. As a service provider, architecting a large-scale storage environment that meets the diverse user needs is a challenging task. When a requirement is addressed there is a cause and effect. Thus, the requirements must be closely examined and considered. This talk will cover the user requirements and the impact on the storage environment.

Understanding Storage System Challenges for Parallel Scientific Simulations (slides)

Dr. Bradley Settlemyer, Los Alamos National Laboratory (bio)

Computer-based simulation is critical to the study of physical phenomena that are difficult or impossible to physically observe. Examples include asteroid collisions, chaotic interactions in climate models, and massless particle interactions. Long-running simulations, such as those at running on Los Alamos National Laboratory's Trinity supercomputer, generate many thousands of snapshots of the simulation state that are written to stable storage for fault tolerance and visualization/analysis. For extreme scale simulation codes, such as the Vector Particle-in-Cell code (VPIC), improving the efficiency of storage system access is critical to accelerating scientific insight and discovery. In this talk we will discuss the structure of the VPIC software architecture and several storage system use cases associated with the VPIC simulation code and the challenges associated with parallel access to the underlying storage system. We will not present solutions but instead focus on the underlying requirements of the scientific use cases including fault tolerance and emerging data analysis workloads that directly accelerate scientific discovery.

The Art of Storage

Eric Bermender, Pixar Animation Studios (bio)

Storage systems are designed to adapt to new looks and new plot lines in service to the stories our studio tells. "The Art of Storage" is a story about how creative decision making influences our storage architecture decisions. This overview will cover many of the storage technologies and considerations currently used within our production pipelines and what future technologies interest us in helping to push the boundaries of animation.

4:30 — 4:40 Break

4:40 — 5:30 Lightning Talks

Session Chair: Sean Roberts

All attendees are welcome to sign up to give an 8-10 minute presentation about their work in massive storage systems and technology. Sign-up board will be available all day, but sign up early to ensure you get a spot before the session is full.

Writing your own file system is easier than you think (slides)

Andrzej Jackowski, 9LivesData

Exascale Failure Modeling with CoFaCTOR (slides)

Dave Bonnie, Los Alamos National Laboratory

Tape’s Not Dead (slides)

Nick Balthaser, Lawrence Berkeley Laboratory—NERSC

Un-scratching Lustre (slides)

Cameron Harr, Lawrence Livermore National Laboratory

5:30 — 7:00 Cocktail Reception Sponsored by Aeon Computing

7:30 — 8:30 Registration / Breakfast

8:30 — 9:30 Keynote

Session Chair: Meghan McClelland

More than Storage (slides)

Dr. Margo Seltzer, University of British Columbia (bio)

The incredible growth and success that our field has experienced over the past half a century has had the side effect of transforming systems into a constellation of siloed fields; storage is one of them. I'm going to make the case that we should return to a road interpretation of systems, undertaking bolder, higher risk projects, and be intentional about how we interact with other fields. I'll support the case with examples or several research projects that embody this approach.

9:30 — 10:00 Break

10:00 — 12:00 Resilience at Scale

Session Chair: Dr. Glenn K. Lockwood

Rethinking End-to-end Reliability in Distributed Cloud Storage System (slides)

Dr. Asaf Cidon, Barracuda Networks

Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. We propose addressing the flash lifetime problem by allowing devices to expose higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems. DIRECT allows distributed storage systems to tolerate a 10,000—100,000x higher bit error rate without experiencing application-visible errors. By significantly increasing the availability and durability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.

A Storage Architecture for Resilient Assured Data (slides)

Paul D. Manno, Georgia Tech

Extreme-scale Data Resilience Trade-offs at Experimental Facilities (slides)

Dr. Sadaf R. Alam, Swiss National Supercoming Centre (bio)

Large scale experimental facilities such as the Swiss Light Source and the free-electron X-ray laser SwissFEL at the Paul Scherrer Institute (PSI), and the particle accelerators and detectors at CERN are experiencing unprecedented data generation growth rates. Consequently, management, processing and storage requirements of data are increasing rapidly. The Swiss National Supercomputing Centre, CSCS, provides computing and storage capabilities, specifically related to a dedicated archiving system for scientific data, for PSI. This talk overviews performance and cost efficiency trade-offs for managing data at rest as well as data in motion for PSI workflows. This co-design approach is needed to address resiliency challenges at extreme scales, in particular, considering unique data generation capabilities at experimental facilities.

Practical erasure codes tradeoffs for scalable distributed storage systems (slides)

Cyril Guyot, Western Digital

Distributed storage systems use erasure codes to reliably and efficiently store data. We will first discuss the various code constructions—LRC, MSR, etc...—that have been used in practice, and examine the tradeoffs in system-level performance metrics that they create. We will then explore how novel storage device interfaces will be modifying some of those tradeoffs.

Rightscaling: Varying data safety techniques with scale (slides)

Lance Evans, Cray, Inc.

12:00 — 1:15 Lunch

1:15 — 3:15 Next Generation Storage Software

Session Chair: Meghan McClelland

How are new algorithms and storage technologies addressing the new requirements of AI and Big Science? How are virtual file systems bridging the gap between big repositories and usability?

CERN's Virtual File System for Global-Scale Software Delivery (slides)

Dr. Jakob Blomer, CERN (bio)

Delivering complex software across a worldwide distributed system is a major challenge in high-throughput scientific computing. Copying the entire software stack everywhere it’s needed isn’t practical—it can be very large, new versions of the software stack are produced on a regular basis, and any given job only needs a small fraction of the total software. To address application delivery in high-energy physics, the global scale virtual file system CernVM-FS distributes software to hundreds of thousands of machines around the world. It uses HTTP for data transport and it provides POSIX access to more than a billion files of application software stacks and operating system containers to end user devices, university clusters, clouds, and supercomputers. This presentation discusses key design choices and trade-offs in the file system architecture as well as practical experience of operating the infrastructure.

Disaggregated, Shared-Everything Infrastructure to Break Long-Standing Storage Tradeoffs (slides)

Renen Hallak, VAST Data (bio)

Storage architects have traditionally had to trade the various virtues of a storage system off against one another sacrificing performance for capacity, scale for simplicity or resilience for cost among others, VAST’s Disaggregated Shared Nothing Architecture (DASE) leverages the latest storage technologies including 3D XPoint, NVMe over Fabrics and QLC flash to break these tradeoffs.

This session will describe the DASE architecture and how it empowers VAST’s Universal Storage system to deliver all-flash performance at petabyte to exabyte scale and at a cost low enough for archival use cases. Customers using Universal Storage can therefore eliminate the islands of storage common in today’s datacenter and expand their data mining to all their data.

ScoutFS: POSIX Archiving at Extreme Scale (slides)

Zach Brown, Versity Software

ScoutFS is an open source clustered POSIX file system built to support archiving of extremely large file sets. This talk will summarize the challenges faced by sites that are managing large archives and delve into the solution Versity is developing. We'll explore the technical details of how POSIX can scale and how we index file system metadata concurrently across a cluster while operating a high bandwidth.

Grand Unified File Index: A Development, Deployment, and Performance Update (slides)

Dominic Manno, Los Alamos National Laboratory (bio)

Compute clusters are growing, and along with them the amount of data being generated is increasing. It is becoming more important for end-users and storage administrators to manage the data, especially when moving between tiers. The Grand Unified File Indexing (GUFI) system is a hybrid indexing capability designed to assist storage admins and users in managing their data. GUFI utilizes trees and embedded databases to securely provide very fast access to an indexed version of their metadata. In this talk we will provide an update on GUFI development, early performance results, and deployment strategies.

3:15 — 3:30 Break

3:30 — 5:00 Future Storage Systems

Session Chair: Dr. Bradley Settlemyer

Moore's Law coming to an end has parallels in the storage industry. What comes next? What lies beyond 10 years with respect to new nonvolatile media? What software approaches can help stem the tide in achieving peak performance and density?

Ultra-dense data storage and extreme parallelism with electronic-molecular systems

Dr. Karin Strauss, Microsoft Research (bio)

In this talk, I will explain how molecules, specifically synthetic DNA, can store digital data and perform certain types of special-purpose computation by leveraging tools already developed by the biotechnology industry.

The Future of Storage Systems – a Dangerous Opportunity (slides)

Rob Peglar, Advanced Computation and Storage, LLC (bio)

We are at a critical point concerning storage systems, in particular, how these systems are integrated into the larger whole of compute and network elements which comprise HPC infrastructure. The good news is, we have a plethora of technologies from which to choose – recording media, device design, subsystem construction, transports, filesystems, access methods, etc. The bad news is, we have lots of choices. This talk will explore the past, present and future of storage systems (emphasis on "systems", not just storage) and the dangerous opportunity we have to significantly improve the state of the art. Fair warning: this may involve the throwing down of gauntlets and the abandoning of long-held beliefs. Remember, as Einstein said in 1946, we cannot solve problems by using the same thinking that created them originally.

NRAM Defines a New Category of "Memory Class Storage" (slides)

Bill Gervasi, Nantero, Inc. (bio)

The value proposition for persistent memory is quite clear: making compute systems immune to data loss on power failure greatly simplifies system design. The emergence of many new persistent memory type including magnetic, phase change, and resistive has enabled these design changes, but differences between them have also confused the industry. This talk details Nantero NRAM, the first of a new class of persistent memories introduced in a new JEDEC specification in progress called “DDR5 NVRAM”. These emerging memories have the performance of a DRAM with data persistence, giving rise to a new term, "Memory Class Storage".

5:00 — 5:10 Break

5:00 — 6:00 Lightning Talks

Session Chair: Sean Roberts

STILTS—Or why LANL doesn’t use HSM (slides)

Dr. Bradley Settlemyer, Los Alamos National Laboratory

Architecting a 30PB all-flash file system (slides)

Dr. Glenn K. Lockwood, Lawrence Berkeley Laboratory

(* Indicates Presenter)

7:30 — 8:30 Registration / Breakfast

Fighting with Unknowns: Estimating the Performance of Scalable Distributed Storage
Systems with Minimal Measurement Data (paper, slides)

Moo-Ryong Ra, Hee Won Lee*
AT&T Labs Research

Constructing an accurate performance model for distributed storage systems has been identified as a very difficult problem. Researchers in this area either come up with an involved mathematical model specifically tailored to a target storage system or treat each storage system as a black box and apply machine learning techniques to predict the performance. Both approaches involve a significant amount of efforts and data collection processes, which often take a prohibited amount of time to be applied to real world scenarios. In this paper, we propose a simple, yet accurate, performance estimation technique for scalable distributed storage systems. We claim that the total processing capability per IO size is conserved across a different mix of read/write ratios and IO sizes. Based on the hypothesis, we construct a performance model which can be used to estimate the performance of an arbitrarily mixed IO workload. The proposed technique requires only a couple of measurement points per IO size in order to provide accurate performance estimation. Our preliminary results are very promising. Based on two widely-used distributed storage systems (i.e., Ceph and Swift) under a different cluster configuration, we show that the total processing capability per IO size indeed remains constant. As a result, our technique was able to provide accurate prediction results.

A Performance Study of Lustre File System Checker: Bottlenecks and Potentials (paper, slides)

Dong Dai*, Om Rameshwar Gatla, Mai Zheng
UNC Charlotte, Iowa State University

Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC.

Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK.

In this paper, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among internal components. Based on these observations, we discussed potential optimization and present some preliminary results.

Scalable QoS for Distributed Storage Clusters using Dynamic Token Allocation (paper, slides)

Yuhan Peng*, Qingyue Liu, Peter Varman
Rice University

The paper addresses the problem of providing performance QoS guarantees in a clustered storage system. Multiple related storage objects are grouped into logical containers called buckets, which are distributed over the servers based on the placement policies of the storage system. QoS is provided at the level of buckets. The service credited to a bucket is the aggregate of the IOs received by its objects at all the servers. The service depends on individual time-varying demands and congestion at the servers.

We present a token-based, coarse-grained approach to providing IO reservations and limits to buckets. We propose pShift, a novel token allocation algorithm that works in conjunction with token-sensitive scheduling at each server to control the aggregate IOs received by each bucket on multiple servers. pShift determines the optimal token distribution based on the estimated bucket demands and server IOPS capacities. Compared to existing approaches, pShift has far smaller overhead, and can be accelerated using parallelization and approximation. Our experimental results show that pShift provides accurate QoS among the buckets with different access patterns, and handles runtime demand changes well.

FastBuild: Accelerating Docker Image Building for Efficient Development and Deployment of Containers
(paper, slides)

Zhuo Huang*, Song Wu, Song Jiang, Hai Jin
Huazhong University of Science and Technology, The University of Texas at Arlington

Docker containers have been increasingly adopted on various computing platforms to provide a lightweight virtualized execution environment. Compared to virtual machines, this technology can often reduce the launch time from a few minutes to less than 10 seconds, assuming the Docker image has been locally available. However, Docker images are highly customizable, and are mostly built at run time from a remote base image by running instructions in a script (the Dockerfile). During the instruction execution, a large number of input files may have to be retrieved via the Internet. The image building may be an iterative process as one may need to repeatedly modify the Dockerfile until a desired image composition is received. In the process, each file required by an instruction has to be remotely retrieved, even if it has been recently downloaded. This can make the process of building, and launching a container unexpectedly slow.

To address the issue, we propose a technique, named FastBuild, that maintains a local file cache to minimize the expensive file downloading. By non-intrusively intercepting remote file requests, and supplying files locally, FastBuild enables file caching in a manner transparent to image building. To further accelerate the image building, FastBuild overlaps operations of instructions' execution, and writing intermediate image layers to the disk. We have implemented FastBuild. Experiments with images and Dockerfiles obtained from Docker Hub show that our system can improve building speed by up to 10 times, reduce downloaded data by 72%.

10:30 — 10:45 Break

BFO: Batch-File Operations on Massive Files for Consistent Performance Improvement (paper, slides)

Yang Yang*, Qiang Cao, Hong Jiang
Huazhong University of Science and Technology, University of Texas at Arlington

Existing local file systems, designed to support a typical single-file access pattern only, can lead to poor performance when accessing a batch of files, especially small files. This single-file pattern essentially serializes accesses to batched files one by one, resulting in a large number of non-sequential, random, and often dependent I/Os between file data and metadata at the storage ends. We first experimentally analyze the root cause of such inefficiency in batch-file accesses. Then, we propose a novel batch-file access approach, referred to as BFO for its set of optimized Batch-File Operations, by developing novel BFOr and BFOw operations for fundamental read and write processes respectively, using a two-phase access for metadata and data jointly. The BFO offers dedicated interfaces for batch-file accesses and additional processes integrated into existing file systems without modifying their structures and procedures. We implement a BFO prototype on ext4, one of the most popular file systems. Our evaluation results show that the batch-file read and write performances of BFO are consistently higher than those of the traditional approaches regardless of access patterns, data layouts, and storage media, with synthetic and real-world file sets. BFO improves the read performance by up to 22.4x and 1.8x with HDD and SSD respectively; and boosts the write performance by up to 111.4x and 2.9x with HDD and SSD respectively. BFO also demonstrates consistent performance advantages when applied to four representative applications, Linux cp, Tar, GridFTP, and Hadoop.

vPFS+: Managing I/O Performance for Diverse HPC Applications (paper, slides)

Ming Zhao*, Yiqi Xu
Arizona State University, VMware

High-performance computing (HPC) systems are increasingly shared by a variety of data- and metadata-intensive parallel applications. However, existing parallel file systems employed for HPC storage management are unable to differentiate the I/O requests from concurrent applications and meet their different performance requirements. Previous work, vPFS, provided a solution to this problem by virtualizing a parallel file system and enabling proportional-share bandwidth allocation to the applications. But it cannot handle the increasingly diverse applications in today's HPC environments, including those that have different sizes of I/Os and those that are metedata-intensive. This paper presents vPFS+ which builds upon the virtualization framework provided by vPFS but addresses its limitations in supporting diverse HPC applications. First, a new proportional-share I/O scheduler, SFQ(D)+, is created to allow applications with various I/O sizes and issue rates to share the storage with good application-level fairness and system-level utilization. Second, vPFS+ extends the scheduling to also include metadata I/Os and provides performance isolation to metadata-intensive applications. vPFS+ is prototyped on PVFS2, a widely used open-source parallel file system, and evaluated using a comprehensive set of representative HPC benchmarks and applications (IOR, NPB BTIO, WRF, and multi-md-test). The results confirm that the new SFQ(D)+ scheduler can provide significantly better performance isolation to applications with small, bursty I/Os than the traditional SFQ(D) scheduler (3.35 times better) and the native PVFS2 (8.25 times better) while still making efficient use of the storage. The results also show that vPFS+ can deliver near-perfect proportional sharing (>95% of the target sharing ratio) to metadata-intensive applications.

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with
Precomputation-Based Mechanisms (paper, slides)

Xiangyu Zou, Tao Lu, Wen Xia, Xuan Wang, Weizhe Zhang, Sheng Di*, Dingwen Tao, Franck Cappello
Harbin Institute of Technology, Marvell Technology Group, Argonne National Laboratory, University of Alabama

Scientific simulations in high-performance computing (HPC) environments are producing vast volume of data, which may cause a severe I/O bottleneck at runtime and a huge burden on storage space for post-analysis. In this work, we develop efficient precomputation-based mechanisms in the SZ lossy compression framework for HPC datasets. Our mechanisms can avoid costly logarithmic transformation and identify quantization factor values via a fast table lookup, greatly accelerating the relative-error bounded compression with excellent compression ratios. In addition, our mechanisms also help reduce traversing operations for Huffman decoding, and thus significantly accelerate the decompression process in SZ. Experiments with four well-known real-world scientific simulation datasets show that our solution can improve the compression rate by about 30% and decompression rate by about 70% in most of cases, making our designed lossy compression strategy the best choice in class in most cases.

12:15 — 1:15 Lunch

Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart (paper, slides)

Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, Seung Woo Son*
University of Massachusetts Lowell

As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x–38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks.

Tiered-ReRAM: A Low Latency and Energy Efficient TLC Crossbar ReRAM Architecture (paper, slides)

Yang Zhang*, Dan Feng, Wei Tong, Jingning Liu, Chengning Wang, Jie Xu
Huazhong University of Science and Technology

Resistive Memory (ReRAM) is promising to be used as high density storage-class memory by employing Triple-Level Cell (TLC) and crossbar structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to the IR drop issue and the iterative program-and-verify procedure. In this paper, we propose Tiered-ReRAM architecture to overcome the challenges of TLC crossbar ReRAM. The proposed Tiered-ReRAM consists of three components, namely Tiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), and Compression-based Flip Scheme (CFS). Specifically, based on the observation that the magnitude of IR drops is primarily determined by the long length of bitlines in Double-Sided Ground Biasing (DSGB) crossbar arrays, Tiered-crossbar design splits each long bitline into the near and far segments by an isolation transistor, allowing the near segment to be accessed with decreased latency and energy. Moreover, in the near segments, CIDM dynamically selects the most appropriate IDM for each cache line according to the saved space by compression, which further reduces the write latency and energy with insignificant space overhead. In addition, in the far segments, CFS dynamically selects the most appropriate flip scheme for each cache line, which ensures more high resistance cells written into crossbar arrays and effectively reduces the leakage energy. For each compressed cache line, the selected IDM or flip scheme is applied on the condition that the total encoded data size will never exceed the original cache line size. The experimental results show that, on average, Tiered-ReRAM can improve the system performance by 30.5%, reduce the write latency by 35.2%, decrease the read latency by 26.1%, and reduce the energy consumption by 35.6%, compared to an aggressive baseline.

vNVML: An Efficient Shared Library for Virtualizing and Sharing Non-volatile Memories (paper, slides)

Chih Chieh Chou*, Jaemin Jung, Narasimha Reddy, Paul Gratz, Doug Voigt
Texas A&M University, Hewlett Packard Enterprise

The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byteaddressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byteaddressable NVM which can be shared across many applications.

In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.

Towards Virtual Machine Image Management for Persistent Memory (paper, slides)

Jiachen Zhang*, Lixiao Cui, Peng Li, Xiaoguang Liu, Gang Wang
Nankai University

Persistent memory's (PM) byte-addressability and high capacity will also make it emerging for virtualized environment. Modern virtual machine monitors virtualize PM using either I/O virtualization or memory virtualization. However, I/O virtualization will sacrifice PM's byte-addressability, and memory virtualization does not get the chance of PM image management. In this paper, we enhance QEMU's memory virtualization mechanism. The enhanced system can achieve both PM's byte-addressability inside virtual machines and PM image management outside the virtual machines. We also design pcow, a virtual machine image format for PM, which is compatible with our enhanced memory virtualization and supports storage virtualization features including thin-provision, base image and snapshot. Address translation is performed with the help of Extended Page Table (EPT), thus much faster than image formats implemented in I/O virtualization. We also optimize pcow considering PM's characteristics. The evaluation demonstrates that our scheme boosts the overall performance by up to 50x compared with qcow2, an image format implemented in I/O virtualization, and brings almost no performance overhead compared with the native memory virtualization.

3:15 — 3:30 Break

Pattern-based Write Scheduling and Read Balance-oriented Wear-leveling for Solid State Drivers
(paper, slides)

Jun Li*, Xiaofei Xu, Xiaoning Peng, Jianwei Liao
Southwestern University

This paper proposes a pattern-based I/O scheduling mechanism, which identifies frequently written data with patterns and dispatches them to the same SSD blocks having a small erase count. The data on the same block are mostly like to be invalided together, so that the overhead of garbage collection can be greatly reduced. Moreover, a read balance-oriented wear- leveling scheme is introduced to extend the lifetime of SSDs. Specifically, it distributes the hot read data in the blocks with a small erase count, to heavily erased blocks in different chips of the same SSD channel, while carrying out wear-leveling. As a result, internal parallelism at the chip level of SSD can be fully exploited for achieving better read data throughput. We conduct a series of simulation tests with a number of disk traces of real- world applications under the SSDsim platform. The experimental results show that the newly proposed mechanism can reduce garbage collection overhead by 11.3%, and the read response time by 12.8% in average, comparing to existing approaches of scheduling and wear-leveling for SSDs.s

When NVMe over Fabrics Meets Arm: Performance and Implications (paper, slides)

Yichen Jia*, Eric Anger, Feng Chen
Louisiana State University, ARM Inc.

A growing technology trend in the industry is to deploy highly capable and power-efficient storage servers based on the Arm architecture. An important driving force behind this is storage disaggregation, which separates compute and storage to different servers, enabling independent resource allocation and optimized hardware utilization. The recently released remote storage protocol specification, NVMe-over-Fabrics (NVMeoF), makes flash disaggregation possible by reducing the remote access overhead to the minimum. It is highly appealing to integrate the two promising technologies together to build an efficient Arm based storage server with NVMeoF.

In this work, we have conducted a set of comprehensive experiments to understand the performance behaviors of NVMeoF on Arm-based Data Center SoC and to gain insight into the implications of their design and deployment in data centers. Our experiments show that NVMeoF delivers the promised ultra-low latency. With appropriate optimizations on both hardware and software, NVMeoF can achieve even better performance than direct attached storage. Specifically, with appropriate NIC optimizations, we have observed a throughput increase by up to 42.5% and a decrease of the 95th percentile tail latency by up to 14.6%. Based on our measurement results, we also discuss several system implications for integrating NVMeoF on Arm based platforms. Our studies show that this system solution can well balance the computation, network, and storage resources for data-center storage services. Our findings have also been reported to Arm and Broadcom for future optimizations.

XORInc: Optimizing Data Repair and Update for Erasure-Coded Systems with XOR-Based
In-Network Computation (paper, slides)

Yingjie Tang*, Fang Wang, Yanwen Xie, Xuehai Tang
Huazhong University of Science and Technology, Institute of Information Engineering, Chinese Academy of Sciences

Erasure coding is widely used in the distributed storage systems due to its significant storage efficiency compared with replication at the same fault tolerance level. However, erasure coding introduces high cross-rack traffic since (1) repairing a single failed data block needs to read other available blocks from multiple nodes and (2) updating a data block triggers parity updates for all parity blocks. In order to alleviate the impact of these traffic on the performance of erasure coding, many works concentrate on designing new transmission schemes to increase bandwidth utilization among multiple storage nodes but they don’t actually reduce network traffic.

With the emergence of programmable network devices, the concept of in-network computation has been proposed. The key idea is to offload compute operations onto intermediate network devices. Inspired by this idea, we propose XORInc, a framework that utilizes programmable network devices to XOR data flows from multiple storage nodes so that XORInc can effectively reduce network traffic (especially the cross-rack traffic) and eliminate network bottleneck. Under XORInc, we design two new transmission schemes, NetRepair and NetUpdate, to optimize the repair and update operations, respectively. We implement XORInc based on HDFS-RAID and SDN to simulate an in-network computation framework. Experiments on a local testbed show that NetRepair reduces the repair time to almost the same as the normal read time and reduces the network traffic by up to 41%, meanwhile, NetUpdate reduces the update time and traffic by up to 74% and 30%, respectively.

(* Indicates Presenter)

7:30 — 8:30 Registration / Breakfast

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM Slots
(paper, slides)

Chunhua Xiao, Linfeng Cheng, Lei Zhang, Duo Liu, Weichen Liu, (Yujuan Tan*)
Chongqing University, Nanyang Technological University

Emerging Non-Volatile Memory (NVM) has many advantages, such as near-DRAM speed, byte-addressability, and persistence. Modern computer systems contain many memory slots, which are exposed as a unified storage interface by shared address space. Since NVM has limited write endurance, many wear-leveling techniques are implemented in hardware. However, existing hardware techniques can only effective in a single NVM slot, which cannot ensure wear-leveling among multiple NVM slots.

This paper explores how to optimize a storage system with multiple NVM slots in terms of performance and lifetime. We show that simple integration of multiple NVMs in traditional memory policies results in poor reliability. We also reveal that existing hardware wear-leveling technologies are ineffective for a system with multiple NVM slots.

In this paper, we propose a common wear-aware memory management scheme for in-memory file system. The proposed memory scheme enables wear-aware control of NVM slot use which minimizes the cost of performance and lifetime. We implemented the proposed memory management scheme and evaluated their effectiveness. The experiments show that the proposed wear-aware memory management scheme can outperform wear-leveling effect by more than 2600x, and the lifetime of NVM can be prolonged by 2.5x, the write performance can be improved by up to 15%.

CeSR: A Cell State Remapping Strategy to Reduce Raw Bit Error Rate of MLC NAND Flash (paper, slides)

Yutong Zhao*, Wei Tong, Jingning Liu, Dan Feng, Hongwei Qin
Huazhong University of Science and Technology

Retention errors and program interference errors have been recognized as the two main types of NAND flash errors. Since NAND flash cells in the erased state which hold the lowest threshold voltage are least likely to cause program interference and retention errors, existing schemes preprocess the raw data to increase the ratio of cells in the erased state. However, such schemes do not effectively decrease the ratio of cells with the highest threshold voltage which are most likely to cause program interference and retention errors. In addition, we note that the dominant error type of flash varies with data hotness. Retention errors are not too much of a concern for frequently updated hot data while cold data that is rarely updated need to worry about the growing retention errors as P/E cycles increase. Furthermore, the effects of these two types of errors on the same cell partially counteract each other. Given the observation that retention errors and program interference errors are both cell-state-dependent, this paper presents a cell state remapping (CeSR) strategy based on the error tendencies of data with different hotness. For different types of data segments, CeSR adopts different flipping schemes to remap the cell states in order to achieve the least error-prone data pattern for written data with different hotness. Evaluation shows that the proposed CeSR strategy can reduce the raw bit error rates of hot and cold data by up to 20.30% and 67.24%, respectively, compared with the state-of-the-art NRC strategy.

Parallel all the time: Plane Level Parallelism Exploration for High Performance SSD (paper, slides)

Congming Gao*, Liang Shi, Jason Chun Xue, Cheng Ji, Jun Yang, Youtao Zhang
Chongqing University, East China Normal University, City University of Hong Kong, University of Pittsburgh

Solid state drives (SSDs) are constructed with multiple level parallel organization, including channels, chips, dies and planes. Among these parallel levels, plane level parallelism, which is the last level parallelism of SSDs, has the most strict restrictions. Only the same type of operations which access the same address in different planes can be processed in parallel. In order to maximize the access performance, several previous works have been proposed to exploit the plane level parallelism for host accesses and internal operations of SSDs. However, our preliminary studies show that the plane level parallelism is far from well utilized and should be further improved. The reason is that the strict restrictions of plane level parallelism are hard to be satisfied. In this work, a from plane to die parallel optimization framework is proposed to exploit the plane level parallelism through smartly satisfying the strict restrictions all the time. In order to achieve the objective, there are at least two challenges. First, due to that host access patterns are always complex, receiving multiple same-type requests to different planes at the same time is uncommon. Second, there are many internal activities, such as garbage collection (GC), which may destroy the restrictions. In order to solve above challenges, two schemes are proposed in the SSD controller: First, a die level write construction scheme is designed to make sure there are always N pages of data written by each write operation. Second, in a further step, a die level GC scheme is proposed to activate GC in the unit of all planes in the same die. Combing the die level write and die level GC, write accesses from both host write operations and GC induced valid page movements can be processed in parallel at all time. As a result, the GC cost and average write latency can be significantly reduced. Experiment results show that the proposed framework is able to significantly improve the write performance without read performance impact.

Economics of Information Storage: The Value in Storing the Long Tail (paper, slides)

James Hughes*
University of California, Santa Cruz (bio)

We have witnessed a 50 million-fold increase in hard disk drive density without a similar increase in performance. How can this unbalanced growth be possible? Can it continue? Can similar unbalanced growth happen in other media? To answer these questions we contrast the value of information storage services with the value of physical storage services. We describe a methodology that separates the costs of capturing, storing and accessing information, and we will show that these aspects of storage systems are independent of each other. We provide arguments for what can happen if the cost of storage continues to decrease. The conclusions are three-fold. First, as capacity of any storage media grows, there is no inherent requirement that performance increase at the same rate. Second, the value of increased capacity devices can be quantified. Third, as the cost of storing information approaches zero, the quantity of information stored will grow without limit.

10:30 — 10:45 Break

DFPE: Explaining Predictive Models for Disk Failure Prediction (paper, slides)

Yanwen Xie*, Dan Feng, Fang Wang, Xuehai Tang, Jizhong Han, Xinyan Zhang
Huazhong University of Science and Technology, Chinese Academy of Sciences

Recent research works on disk failure prediction achieve a high detection rate and a low false alarm rate with complex models at the cost of explainability. The lack of explainability is likely to hide bias or overfitting in the models, resulting in bad performance in real-world applications. To address the problem, we propose a new explanation method DFPE designed for disk failure prediction to explain failure predictions made by a model and infer prediction rules learned by a model. DFPE explains failure predictions by performing a series of replacement tests to find out the failure causes while it explains models by aggregating explanations for the failure predictions. A presented use case on a real-world dataset shows that compared to current explanation methods, DFPE can explain more about failure predictions and models with more accuracy. Thus it helps to target and handle the hidden bias and overfitting, measures feature importances from a new perspective and enables intelligent failure handling.

Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy
with Enhanced HDD Controllability and Observability (paper, slides)

Jingpeng Hao*, Yin Li, Xubin Chen, Tong Zhang
Rensselaer Polytechnic Institute

This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.

Adjustable flat layouts for Two-Failure Tolerant Storage Systems (paper, slides)

Thomas Schwarz*
Marquette University

Systems suffer component failure at sometimes unpredictable rates. Storage systems are no exception; they add redundancy in order to deal with various types of failures. The additional storage constitutes an important capital and operational cost and needs to be dimensioned appropriately. Unfortunately, storage device failure rates are difficult to predict and change over the lifetime of the system.

Large disk-based storage centers provide protection against failure at the level of objects. However, this abstraction makes it difficult to adjust to a batch of devices that fail at a higher than anticipated rate. We propose here a solution that uses large pods of storage devices of the same kind, but that can re-organize in response to an increased number of failures of components seen elsewhere in the system or to an anticipated higher failure rate such as infant mortality or end-of-life fragility.

Here, I present ways of organizing user data and parity data that allow us to move from three-failure tolerance to two-tolerance and back. A storage system using disk drives that might be suffering from infant mortality can switch from an initially three-failure-tolerant layout to a two-failure-tolerant one when disks have been burnt in. It gains capacity by shedding failure tolerance that have become unnecessary. A storage system using Flash can sacrifice capacity for reliability as its components have undergone many write-erase cycles and thereby become less reliable.

Adjustable reliability is easy to achieve using a standard layout based on RAID Level 6 stripes where it is easy to convert components containing user data to ones containing parity data. Here, we present layouts that unlike the RAID layout use only exclusive-or operations, and do not depend on sophisticated, but power-hungry processors. There main advantage is a noticeable increase in reliability over RAID Level 6.

12:15 — 1:15 Lunch

AZ-Code: An Efficient Availability Zone Level Erasure Code to Provide High Fault
Tolerance in Cloud Storage Systems (paper, slides)

Xin Xie, Chentao Wu*, Junqing Gu, Han Qiu, Jie Li, Minyi Guo, Xubin He, Yuanyuan Dong, Yafei Zhao
Shanghai Jiao Tong University, Temple University, Alibaba Group

As data in modern cloud storage system grows dramatically, it’s a common method to partition data and store them in different Availability Zones (AZs). Multiple AZs not only provide high fault tolerance (e.g., rack level tolerance or disaster tolerance), but also reduce the network latency. Replication and Erasure Codes (EC) are typical data redundancy methods to provide high reliability for storage systems. Compared with the replication approach, erasure codes can achieve much lower monetary cost with the same fault-tolerance capability. However, the recovery cost of EC is extremely high in multiple AZ environment, especially because of its high bandwidth consumption in data centers. LRC is a widely used EC to reduce the recovery cost, but the storage efficiency is sacrificed. MSR code is designed to decrease the recovery cost with high storage efficiency, but its computation is too complex.

To address this problem, in this paper, we propose an erasure code for multiple availability zones (called AZ-Code), which is a hybrid code by taking advantages of both MSR code and LRC codes. AZ-Code utilizes a specific MSR code as the local parity layout, and a typical RS code is used to generate the global parities. In this way, AZ-Code can keep low recovery cost with high reliability. To demonstrate the effectiveness of AZ-Code, we evaluate various erasure codes via mathematical analysis and experiments in Hadoop systems. The results show that, compared to the traditional erasure coding methods, AZ-Code saves the recovery bandwidth by up to 78.24%.

Long-Term JPEG Data Protection and Recovery for NAND Flash-Based Solid-State Storage (paper, slides)

Yu-Chun Kuo*, Ruei-Fong Chiu, Ren-Shuo Liu
Department of Electrical Engineering, National Tsing Hua Univesity

NAND flash memory is widely used in solid-state storage including SD cards and eMMC chips, in which JPEG pictures are one of the most valuable data. In this work, we study NAND flash memory-aware, long-term JPEG data protection and recovery. Our goal is to increase the robustness of JPEG files stored in flash-based storage and rescue JPEG files that are corrupted due to long-term retention. We conduct real-system experiments by storing JPEG files on 16 nm, 3-bit-per-cell flash chips and letting the JPEG files undergo a retention process equivalent to ten years at 25 degree Celsius. Experimental results show that the proposed techniques can rescue corrupted JPEG files to achieve a significant PSNR improvement.

Parity-Only Caching for Robust Straggler Tolerance (paper, slides)

Mi Zhang*, Qiuping Wang, Zhirong Shen, Patrick P. C. Lee
The Chinese University of Hong Kong

Stragglers (i.e., nodes with slow performance) are prevalent and incur performance instability in large-scale storage systems, yet it is challenging to detect stragglers in practice. We make a case by showing how erasure-coded caching provides robust straggler tolerance without relying on timely and accurate straggler detection, while incurring limited redundancy overhead in caching. We first analytically motivate that caching only parity blocks can achieve effective straggler tolerance. To this end, we present POCache, a parity-only caching design that provides robust straggler tolerance. To limit the erasure coding overhead, POCache slices blocks into smaller subblocks and parallelizes the coding operations at the subblock level. Also, it leverages a straggler-aware cache algorithm that takes into account both file access popularity and straggler estimation to decide which parity blocks should be cached. We implement a POCache prototype atop Hadoop 3.1 HDFS, while preserving the performance and functionalities of normal HDFS operations. Our extensive experiments on both local and Amazon EC2 clusters show that in the presence of stragglers, POCache can reduce the read latency by up to 87.9% compared to vanilla HDFS.

Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection (paper, slides)

Jingwei Li, Patrick P. C. Lee, Yanjing Ren*, Xiaosong Zhang
University Electronic Science and Technology of China, The Chinese University of Hong Kong

Encrypted deduplication combines encryption and deduplication in a seamless way to provide confidentiality guarantees for the physical data in deduplication storage, yet it incurs substantial metadata storage overhead due to the additional storage of keys. We present a new encrypted deduplication storage system called Metadedup, which suppresses metadata storage by also applying deduplication to metadata. Its idea builds on indirection, which adds another level of metadata chunks that record metadata information. We find that metadata chunks are highly redundant in real-world workloads and hence can be effectively deduplicated. In addition, metadata chunks can be protected under the same encrypted deduplication framework, thereby providing confidentiality guarantees for metadata as well. We evaluate Metadedup through microbenchmarks, prototype experiments, and trace-driven simulation. Metadedup has limited computational overhead in metadata processing, and only adds 6.19% of performance overhead on average when storing files in a networked setting. Also, for real-world backup workloads, Metadedup saves the metadata storage by up to 97.46% at the expense of only up to 1.07% of indexing overhead for metadata chunks.

3:15 — 3:30 Break

CDAC: Content-Driven Deduplication-Aware Storage Cache (paper, slides)

Yujuan Tan*, Jing Xie, Congcong Xu, Zhichao Yan, Hong Jiang, Yajun Zhao, Min Fu, Xianzhang Chen, Duo Liu, Wen Xia
Chongqing University, HP, University of Texas Arlington, Sangfor, Harbin Institute of Technology

Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the blocksize is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplication may not be able to improve the overall storage performance because of the high over-head created by deduplication.

To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks’ content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDAC-LRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and D-ARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.

SES-Dedup: a Case for Low-Cost ECC-based SSD Deduplication (paper, slides)

Zhichao Yan*, Hong Jiang, Song Jiang, Yujuan Tan, Hao Luo
The University of Texas-Arlington, Hewlett Packard Enterprise (Nimble Storage), Chongqing University, Twitter

Integrating the data deduplication function into Solid State Drives (SSDs) helps avoid writing duplicate contents to NAND flash chips, which will not only effectively reduce the number of Program/Erase (P/E) operations to extend the device's lifespan but also proportionally enlarge the logical capacity of SSD to improve the performance of its behind-the-scenes maintenance jobs such as wear-leveling (WL) and garbage-collection (GC). However, these benefits of deduplication come at a non-trivial computational cost incurred by the embedded SSD controller to compute cryptographic hashes. To address this overhead problem, some researchers have suggested replacing cryptographic hashes with error correction codes (ECCs) already embedded in the SSD chips to detect the duplicate contents. However, all existing attempts have ignored the impact of the data randomization (scrambler) module that is widely used in modern SSDs, thus making it impractical to directly integrate ECC-based deduplication into commercial SSDs. In this work, we revisit SSD's internal structure and propose the first deduplicatable SSD that can bypass the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication. We have evaluated our SES-Dedup approach by feeding data traces to the SSD simulator and found that it can remove up to 30.8% redundant data with up to 17.0% write performance improvement over the baseline SSD.

LIPA: A Learning-based Indexing and Prefetching Approach for data deduplication (paper, slides)

Guangping Xu*, Chi Wan Sung, Quan Yu, Hongli Lu, Bo Tang
Tianjin University of Technology, City University of Hong Kong, WHUT, TJUT

In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods.

In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.

2019 Organizers
Conference Co-Chairs	Dr. Ahmed Amer, Dr. Sam Coleman
Tutorial Chair	Sean Roberts
Invited Track Program Co-Chairs	Dr. Glenn K. Lockwood, Dr. Michal Simon
Research Track Program Co-Chairs	James Hughes, Thomas Schwarz
Research Track Program Committee
Communications Chair	Meghan Wingate McClelland
Local Arrangements Chair	Prof. Yuhong Liu
Registration Chair	Prof. Behnam Dezfouli

Page Updated January 12, 2024

Sponsored by Santa Clara University, School of Engineering

Technically Co- Sponsored by

2019 Conference

Many Thanks to Our Sponsors!

Sponsored by Santa Clara University,
School of Engineering

Technically Co-
Sponsored by