Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Understanding Long-Term Storage Access Patterns

Abstract

The past two decades have seen an explosion in both the growth and roles of long-term digital archival storage. While the traditional role of tertiary storage as an archive has persisted, there are many new use-cases as well, such as public historical document archives and climate sensor data. Yet, despite this expansion, our understanding of long-term storage is out of date. We have no insights into how these new archival use-cases behave, and even our understanding of tertiary storage behavior is decades old. Without up-to-date information on their behavior we cannot validate the effectiveness of both current and future archival architectures.

To address this issue, in my thesis we explore a variety of new and old archival use cases ranging from public historical data archives to private HPC tertiary storage systems. In our investigations, we found three primary results that held true across a variety of archives. First, we found that the oft-quoted "Write-once, Read-maybe" assumption was questionable in light of unpredictable users and system generated requests, calling into question the effectiveness of architectures that assume data is cold and immutable. Second, we observed that, in contrast to enterprise storage, there was not a clear subset of files responsible for most activity, making caching ineffective from the perspective of the archive. Third, we saw that aggregate accesses were largely unpredictable, but individual users showed strong locality of access which can be leveraged to reduce the number of media accesses and improve overall system efficiency.

The latter portion of my thesis is informed by the difficulties in analyzing the various archival datasets we obtained. We found that a lack of knowledge on a dataset's coverage, what actions were and were not captured, caused most of our difficulties. We approached this problem by developing a method we call expectation difference or ExDiff. ExDiff uses a combination of metadata snapshots and access logs to derive an expected system state that can be compared to actual metadata. Differences between the expected state and reality provide clues as to what is and is not being captured in any given log. This coverage data can be used to improve a variety of storage system tasks ranging from trace analysis to debugging and intrusion detection.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View