Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Declarative Profiling for Parallel Systems

Abstract

The popularity of parallel systems for building high performance software only continues to rise. Programming these systems has always been a challenging task, and ensuring that they are performing optimally even more so. To assist programmers in this space, a wealth of research has been conducted into building profilers for these systems. Unsurprisingly, balancing the requirements of utility, accuracy, and overhead make this also a challenging task. While existing profilers do an admirable job of accomplishing their stated goals, they all suffer from a lack of flexibility. The toolbox of the parallel programmer is filled to the brim with finely crafted specialized tools, but hardly any general ones. Some require the use of a specific programming language or threading library. Others are closely coupled with the underlying hardware and assume the presence of specific monitoring support therein. Many are restricted to only one type of parallel system, such as shared memory multicore machines. To make matters worse, since these tools are all independent they have distinct interfaces, output formats, and requirements for their use. This makes performance analysis and debugging of parallel programs a needlessly frustrating task. In this thesis, we propose and develop a new system for profiling parallel systems called Context Sensitive Parallel Execution Profiles (CSPs) which is vastly more flexible than existing options. CSPs adopt a declarative approach in which the developer uses our annotation language to specify code regions of interest, and our query language to specify quantities to measure in terms of those regions. CSPs do not require the use of a specific language or threading library. They use only widely available hardware features, making them mostly platform agnostic. We first implement our system for shared memory multicore machines and show that it has low overhead, high accuracy, and can be used to diagnose and repair performance problems in real parallel programs. In a test using the Parsec benchmark suite, time overheads were typically less than 5%, and peak memory overheads were less than 46%. Measurements made using CSPs allowed us to optimize the execution of two of the programs by 36% and 17% respectively. We then adapt our implementation to the distributed space, enabling the profiling of clusters of multicore machines. A fundamental problem in distributed profiling is that of timestamp synchronization, which involves the meaningful comparison of timestamps taken on different machines. We developed a new algorithm for timestamp synchronization which is up to 53.3% more accurate than existing algorithms. We further exhibit the flexibility of our system by extending it to compute a variant of causal profiles (a popular type of profile recently developed for shared memory systems) for distributed systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View