Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Replay Debugging for the Datacenter

Abstract

Debugging large-scale, data-intensive, distributed applications running in a datacenter ("datacenter applications") is complex and time-consuming. The key obstacle is non-deterministic failures--hard-to-reproduce program misbehaviors that are immune to traditional cyclic debugging techniques. Datacenter applications are rife with such failures because they operate in highly non-deterministic environments: a typical setup employs thousands of nodes, spread across multiple datacenters, to process terabytes of data per day. In these environments, existing methods for debugging non-deterministic failures are of limited use. They either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing.

To help remedy the situation, we have built a new deterministic replay tool. Our tool, called DCR, enables the reproduction and debugging of non-deterministic failures in production datacenter runs. The key observation behind DCR is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behavior of the control-plane--the most error-prone component of datacenter applications. DCR leverages this observation to relax the determinism guarantees offered by the system, and consequently, to address key requirements of production datacenter applications: lightweight recording of long-running programs, causally consistent replay of large-scale clusters, and out-of-the box operation with existing, real-world applications running on commodity multiprocessors.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View