Teoh, Jia

Automated Performance and Correctness Debugging for Big Data Analytics

2022

Teoh, Jia
Advisor(s): Kim, Miryung

Abstract

The constantly increasing volume of data collected in every aspect of our daily lives has necessitated the development of more powerful and efficient analysis tools. In particular, data-intensive scalable computing (DISC) systems such as Google’s MapReduce [36], Apache Hadoop [4], and Apache Spark [5] have become valuable tools for consuming and analyzing large volumes of data. At the same time, these systems provide valuable programming abstractions and libraries which enable adoption by users from a wide variety of backgrounds such as business analytics and data science. However, the widespread adoption of DISC systems and their underlying complexity have also highlighted a gap between developers’ abilities to write applications and their abilities to understand the behavior of their applications. By merging distributed systems debugging techniques with software engineering ideas, our hypothesis is that we can design accurate yet scalable approaches for debugging and testing of big data analytics’ performance and correctness. To design such approaches, we first investigate how we can combine data provenance with latency propagation techniques in order to debug computation skew —abnormally high computation costs for a small subset of input data —by identifying expensive input records. Next, we investigate how we can extend taint analysis techniques with influence-based provenance for many-to-one dependencies to enhance root cause analysis and improve the precision of identifying fault-inducing input records. Finally, in order to replicate performance problems based on described symptoms, we investigate how we can redesign fuzz testing by targeting individual program components such as user-defined functions for focused, modular fuzzing, defining new guidance metrics for performance symptoms, and adding skew-inspired input mutations and mutation operation selector strategies. For the first hypothesis, we introduce PERFDEBUG, a post-mortem performance debugging tool for computation skew—abnormally high computation costs for a small subset of input data. PERFDEBUG automatically finds input records responsible for such abnormalities in big data applications by reasoning about deviations in performance metrics such as job execution time, garbage collection time, and serialization time. The key to PERFDEBUG’s success is a data provenancebased technique that computes and propagates record-level computation latency to track abnormally expensive records throughout the application pipeline. Finally, the input records that have the largest latency contributions are presented to the user for bug fixing. Our evaluation of PERFDEBUG using in-depth case studies demonstrates that remediation such as removing the single most expensive record or simple code rewrites can achieve up to 16X performance improvement. Second, we present FLOWDEBUG, a fault isolation technique for identifying a highly precise subset of fault-inducing input records. FLOWDEBUG is designed based on key insights using precise control and data flow within user-defined functions as well as a novel notion of influence-based provenance to rank importance between aggregation function inputs. By design, FLOWDEBUG does not require any modification to the framework’s runtime and thus can be applied to existing applications easily. We demonstrate that FLOWDEBUG significantly improves the precision of debugging results by up to five orders-of-magnitude and avoids repetitive re-runs required for post-mortem analysis by a factor of 33 compared to existing state-of-the-art systems. Finally, we discuss PERFGEN, a performance debugging aid which replicates performance symptoms via automated workload generation. PERFGEN effectively generates symptomproducing test inputs by using a phased fuzzing approach that extends traditional fuzz testing to target specific user-defined functions and avoids additional fuzzing complexity from program executions that are unlikely unrelated to the target symptom. To support PERFGEN, we define a suite of guidance metrics and performance skew symptom patterns which are then used to derive skew-oriented mutations for phased fuzzing. We evaluate PERFGEN using four case studies which demonstrate an average speedup of at least 43X speedup compared to traditional fuzzing approaches, while requiring less than 0.004% of fuzzing iterations.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Automated Performance and Correctness Debugging for Big Data Analytics