Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Seeing the forest and the trees: Tackling Distributed Systems Problems by Querying Observations of Executions

Creative Commons 'BY-SA' version 4.0 license
Abstract

Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubleshoot. Fundamentally, reasoning about distributed system behaviors is hard due to the effects of partial failures and nondeterminism in system executions. For example, we expect systems to remain available even if some number of replicas fail. These problems are exacerbated by the dynamic nature and scale of production systems today. Tooling support has lagged behind the pace at which systems are being deployed, urgently requiring more research in this space.

Our overarching claim is that many common distributed systems problems such as improving fault tolerance or debugging failures can be addressed by querying observations of executions. Since our system view consists of observations of system executions, rather than the system itself, we require that executions must exercise varied paths for us to have a comprehensive understanding of the system. A second requirement is that since events in distributed executions may be separated by space and time, observations must capture both events and how they relate to each other within individual executions.

Our key insight is that we need to aggregate information from many executions while preserving the causal relationships within individual executions to answer the posed questions. We do so by building models of domain knowledge and deriving insights about system operation from the observations. We use provenance graphs (a growing area of research) and distributed traces, which have seen increased adoption in industry, as observations of system executions since they capture the causality of event interactions within executions and normalize them to aggregate information across many executions.

Prior work uses observability infrastructure to aggregate information from many executions or compare pairs of executions while preserving casual relationships within executions, but not both. By aggregating metrics and logs, methodologies to address problems such as fault detection, localization and anomaly detection have been investigated. Other work compares pairs of executions for interactive debugging, performance diagnostics, workload and capacity modeling. The former approach either disregards the casualty of event interactions within executions or attempts to infer them, producing sub-par results, while the latter is lacking since it only considers a single pair of executions but many varied execution paths are exercised.

In our work, we have developed and evaluated techniques for understanding and improving fault tolerance behavior, troubleshooting systems, and identifying instances of common design patterns that have applications in building domain knowledge, feature development and debugging performance issues. We explore how the problems that can be solved are constrained differently or change entirely depending on factors such as the granularity and format of system observations, timeline of expected response, how interactive (or not) techniques are expected to be, and the level of detail in the result produced.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View