Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Architecting for Performance Clarity in Data Analytics Frameworks

Abstract

There has been much research devoted to improving the performance of data analytics frameworks, but comparatively little effort has been spent systematically identifying the performance bottlenecks of these systems. Without an understanding of what factors are most important to performance, users do not know how to choose a software and hardware configuration to optimize runtime, and developers do not know which optimizations are most important to implement.

This thesis explores how to architect systems for performance clarity: the ability to understand where bottlenecks lie and the performance implications of various system changes. First, we focus on incrementally adding performance clarity to current data analytics frameworks. We develop blocked time analysis, a methodology for quantifying performance bottlenecks in parallelized systems, and use it to analyze the Spark framework’s performance on two SQL benchmarks and one production workload. Contrary to commonly-held beliefs about performance, we find that (i) CPU (and not I/O) is often the bottleneck, (ii) improving network performance can improve job completion time by at most 2%, and (iii) the causes of most stragglers can be identified.

Blocked time analysis helped to understand performance bottlenecks in today’s frameworks, but fell short of enabling users to reason about the impact of potential hardware and software configuration changes. Given the challenges to providing performance clarity in current architectures, the second part of this thesis focuses on a new system architecture built from the ground up for performance clarity. Rather than breaking jobs into tasks that pipeline many resources, as in today’s frameworks, we propose breaking jobs into units of work that each use a single resource, called monotasks. We demonstrate that explicitly separating the use of different resources simplifies reasoning about performance without sacrificing fast runtimes. Our implementation of monotasks provides job completion times within 9% of Apache Spark, and leads to a model for job completion time that predicts runtime under different hardware and software configurations with at most 28% error for most predictions.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View