«Rosie Wacha rwacha Storage Systems Research Center Baskin School of Engineering University of California, Santa Cruz Santa Cruz, CA 95064 ...»
Data Reliability Techniques for Specialized
Technical Report UCSC-SSRC-09-02
March 17, 2009
Storage Systems Research Center
Baskin School of Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
UNIVERSITY OF CALIFORNIA
DATA RELIABILITY TECHNIQUES FOR SPECIALIZED STORAGE
ENVIRONMENTSA project submitted in partial satisfaction of the requirements for the degree of
MASTERS OF SCIENCEin
COMPUTER SCIENCEby Rosie Wacha December 2008 The project of Rosie Wacha
Professor Darrell D. E. Long, Chair Professor Ethan L. Miller Acknowledgments I would like to thank the following people for their help and support: Darrell Long, Ethan Miller, Thomas Schwarz, Scott Brandt, Gary Grider, James Nunez, John Bent, Ralph Becker-Szendy, Neerja Bhatnagar, Kevin Greenan, Bo Hong, Bo Adler, Alisa Neeman, Esteban Molina-Estolano, Valerie Aurora, Julya Wacha, Diane Wacha, and Noah Wacha.
I also want to thank the following organizations for funding my research: UC Regents, Graduate Assistance in Areas of National Need (GAANN), Los Alamos National Laboratory (LANL), and the Institute for Scalable Scientiﬁc Data Management (ISSDM).
ii Contents Acknowledgments ii List of Figures v List of Tables vi
vii 1 Introduction 1 2 Synthetic Parallel Applications 3
2.1 Introduction.................................... 3
2.2 Related Work................................... 4
2.3 How to Create the SPA....................
Data reliability has been extensively studied and techniques such as RAID and erasure coding are commonly used in storage systems. Real workload data is also important for storage systems research. We developed a tool to streamline the process of releasing workload data by automatically removing all non-I/O activity from software. The tool creates a Synthetic Parallel Application (SPA) that has the same I/O behavior as the original program when it is run. Next, we address reliability in the context of two speciﬁc storage environments, namely sensor networks and tape archives.
Sensor networks are made up of individual nodes that are highly constrained in power.
Due to reduced storage costs, nodes are increasingly storage-based and transmitting data to a base station is reduced in order to conserve power and camouﬂage the network in hostile environments. We investigated the tradeoff between power and reliability for storage-based sensor networks using Reed-Solomon, XOR-based codes, and mirroring. Results show that our Reed-Solomon implementation provides higher reliability with more ﬂexibility but with a higher energy cost. Also, the XOR2 reliability scheme we designed provides reliability close to that of 4-way mirroring at half the storage space overhead.
Commercial tape drives have high reliability ratings. However, many individual drives make up an entire archive. In order to achieve good write performance, data is often written in a striped pattern so that several tape drives are used to store a single ﬁle. Thus reliability is a signiﬁcant concern and additional reliability techniques are often used. We investigated the performance overhead of row-diagonal parity (RDP) in the context of a large tape archive. Results show that our parallel implementation scales well for small numbers of nodes, with twice the initial write bandwidth of data when the stripe size (and number of nodes) is doubled. Future work will compare the performance of RDP with Reed-Solomon and evaluate scalability with higher numbers of nodes.
Reliability can be achieved in many ways. The SPA project can help improve storage reliability by allowing software that normally could only be tested in a single environment to be run on different hardware setups. Sensor nodes often have very limited power available due to the locations where they are often deployed. The reliability of data measured from one node is not always essential, particularly if another nearby node measured the same data. The choice of reliability technique for a sensor network must be made in the contex of these constraints.
The data stored in tape archives is often never read, but if it is needed it must be there. We can sacraﬁce some extra hardware as long as performance is not signiﬁcantly lowered. This project investigates these three areas of reliability.
One of the central requirements for most ﬁle systems research is good workload data.
Most of the time this data is contained in a log of I/O requests, known as a trace. Collecting and releasing traces is not glamorous – ﬁle systems researchers typically only do this out of necessity. No one really wants to collect traces because it is a time consuming process and there are privacy concerns that must be addressed before the data can be released.
The ﬁrst part of this project is a tool that simpliﬁes the process of collecting traces of real parallel applications and releasing them to the public. The basic input to the tool is a parallel application that can be run on a cluster. The tool runs the application and collects traces at each node. Then these traces are automatically analyzed to detect all I/O behavior and a new program, called a Synthetic Parallel Application (SPA), is written that will perform the same I/O activities at the same times. All non-I/O behaviors in the trace are ignored and not present in the SPA. Our results show that I/O traces collected from running the SPA closely match the original traces.
The second part of this project is an investigation of reliability for two storage environments: sensor networks and tape archives. Good data reliability can be achieved by simply mirroring data on several disks. More copies of data provide more reliability. However, the hardware cost quickly grows unmanageable. Particularly in environments where traditional disks are not used or are only part of the storage system, more sophisticated reliability strategies are helpful.
Sensor nodes that store their data locally are increasingly being deployed in hostile and remote environments such as active volcanos and battleﬁelds. Observations gathered in these environments are often irreplaceable, and must be protected from loss due to node failures. Nodes may fail individually due to power depletion or hardware/software problems, or they may suffer correlated failures from localized destructive events such as ﬁre or rock fall.
While many ﬁle systems can guard against these events, they do not consider energy usage in their approach to redundancy. We examine tradeoffs between energy and reliability in three contexts: choice of redundancy technique, choice of redundancy nodes, and frequency of verifying correctness of remotely-stored data. By matching the choice of reliability techniques to the failure characteristics of sensor networks in hostile and inaccessible environments, we can build systems that use less energy while providing higher system reliability.
Tape drives were invented by IBM in the 1950s . Tape archives are still used for data that is written once and then rarely read or updated. Fast write performance can be achieved by writing data in a striped pattern. A very large ﬁle is broken up into several chunks and each chunk is written to a separate tape device. For example, a 128GB ﬁle might be broken up into 128 chunks where each is written to a tape. The time to write that ﬁle would be the time to write 1GB. Striping like this is actually done on a much larger scale. The problem is that reliability for that large ﬁle is degraded signiﬁcantly when only striping. If any one of those 128 tape drives is damaged, that ﬁle cannot be reconstructed. Reliability in the context of such high performance requirements is quite challenging. For example, suppose a 1 GB tape cartridge is expected to last 30 years , which is a mean time between failures a little above 105 hours. If the entire archive contains 4000 cartridges, we expect to see a failure every day.
In high performance computing the stripe width can be very large, meaning that a single ﬁle may be broken into thousands of pieces and each piece is stored on one devices. A single parity provides some protection, but with thousands of devices it is not sufﬁcient. We implemented a software RAID that performs mirroring, RAID 4, and Row-Diagonal Parity (RDP). We measured the performance of RDP to determine how much processing time is required to compute two parities.
In summary, we investigated reliability from several points of view in some very speciﬁc contexts. The ﬁrst is that of the I/O workload and how it can affect the choice of reliability method for a storage system. The SPA provides a method for running the I/O subset of proprietary or private code on untrusted hardware. This allows more applications to be used as benchmarks for new algorithms and can help improve data reliability. Sensor networks typically have very speciﬁc constraints. Some of these are limited power, cheap hardware that is more likely to fail, and deployment in hostile environments, each of which further increase the likelihood of node failure. The choice of reliability technique must address these constraints and provide reasonable reliability in creative ways. For example, storing a mirror copy of data from one node to another far away in the network can protect the data better than storing that copy at a nearby neighbor. Lastly, tape archives have unusual access patterns and requirements.
Individual hardware components are relatively reliable. In larger systems, components are often utilized in parallel to improve performance but resulting in a much lower overall system reliability. A large ﬁle can be quickly written to tape, but then that ﬁle requires all those tape drives to be functional in order to reconstruct that one ﬁle. The performance impact of adding erasure coding techniques is important and must be addressed to ensure that performance isn’t degraded to near what it was without striping.
Chapter 2 Synthetic Parallel Applications
2.1 Introduction Workload data is useful for ﬁle systems researchers, particularly for simulations of new algorithms and designs. This data is available in a variety of forms, such as traces and benchmarks. Traces can be logs of the behavior of the entire ﬁle system or as small as a single application. Traces can be quite large, especially if the application is very long-running or performs many actions. For this reason, it is difﬁcult to create traces regularly for changing workloads and applications. Also, both because of their large sizes and the private information contained, they are difﬁcult to share with researchers outside a particular organization.
Benchmarks are used to evaluate a system under a speciﬁc load. For convenience, benchmarks are often used many times to evaluate many different systems since there is a large cost in designing new appropriate benchmarks. Benchmarks are often designed as synthetic programs which don’t perform a necessarily useful programmatic function but stress the system in a speciﬁc way to determine certain characteristics of the system, such as peak I/O bandwidth.
While this type of benchmark is useful for comparing systems under speciﬁc requirements, they don’t capture system metrics under “typical” conditions of user applications on a system.
The ideal situation would be if we could release real user applications as benchmarks that then could be used to compare systems, which is conceptually the goal of the ﬁrst part of this project.
We created a tool that creates an I/O skeleton program, called a Synthetic Parallel Application, from a real scientiﬁc program. This work was completed while working at Los Alamos National Laboratory.