Van Biesbrouck, Michael

Sampled simulation for multithreaded processors

2007

Van Biesbrouck, Michael

Abstract

Microarchitectural simulation of multithreaded architectures with shared resources, such as simultaneous multithreading (SMT) cores and multi-core processors with shared caches, is time-consuming and the results of simulation may be difficult to interpret. It is time- consuming because modern benchmarks run for hundreds of billions (or even trillions) of instructions, and accurate multi-core and SMT simulation requires higher-detail models than single-threaded simulation. The statistics collected when two programs execute together can be difficult to interpret because the programs both exhibit independent phase behavior and affect each other's execution. Starting one program slightly later than during the original execution will change the phases that execute together and thus change the effects that the programs have on each other. Accurate sampled simulation requires accurate sample collection. We evaluate techniques to improve sampling accuracy and performance, both for single -threaded and multithreaded simulation. These techniques include warming the CPU with detailed execution, storing cache state and techniques to minimize the size of checkpoints. Previous work showed that single-program performance can be accurately estimated by dividing execution into phases and only simulating representative samples from each phase. We demonstrate that the juxtaposition of phases (̀co-phase') from a pair of programs has similar behavior to a single-threaded phase. Furthermore, simulation of all possible co-phases allows analysis of all distinct SMT behaviors and this comprehensive knowledge of program interactions can be combined with information about the sequence of phases executed by each program to reconstruct the combined execution of the programs from any given starting point. Given the short samples, the set of executions from all possible starting offsets can be sampled in minutes, determining the average performance of the programs. This removes the problem of interpreting the results of small numbers of experiments. Finally, we propose three techniques for using the co-phase techniques to summarize the behavior of all possible interactions within a suite of benchmarks. We reduce the scale of this problem using Principle Components Analysis, allowing our techniques to scale to large numbers of benchmarks an concentrate simulation on the most significant behaviors

Main Content

For improved accessibility of PDF content, download the file to your device.

UC San Diego

Sampled simulation for multithreaded processors