ThemisMR: An I/O-Efficient MapReduce
Skip to main content
eScholarship
Open Access Publications from the University of California

ThemisMR: An I/O-Efficient MapReduce

Abstract

"Big Data" computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present ThemisMR, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amount possible for data sets that cannot fit in memory. In order to minimize I/O, ThemisMR makes fundamentally different design decisions from previous MapReduce implementations. ThemisMR performs a wide variety of MapReduce jobs – including click log analysis, DNA read sequence alignment, and PageRank – at nearly the speed of TritonSort’s record-setting sort performance.

Pre-2018 CSE ID: CS2012-0983

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View