Skip to main content
eScholarship
Open Access Publications from the University of California

Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for applications on Intel® Xeon Phi™ Processor

Abstract

In this session we show, in two case studies, how the roofline feature of Intel Advisor has been utilized to optimize the performance of kernels of the XGC1 and PICSAR codes in preparation for Intel Knights Landing architecture. The impact of the implemented optimizations and the benefits of using the automatic roofline feature of Intel Advisor to study performance of large applications will be presented. This demonstrates an effective optimization strategy that has enabled these science applications to achieve up to 4.6 times speed-up and prepare for future exascale architectures. # Goal/Relevance of Session The roofline model [1,2] is a powerful tool for analyzing the performance of applications with respect to the theoretical peak achievable on a given computer architecture. It allows one to graphically represent the performance of an application in terms of operational intensity, i.e. the ratio of flops performed and bytes moved from memory in order to guide optimization efforts. Given the scale and complexity of modern science applications, it can often be a tedious task for the user to perform the analysis on the level of functions or loops to identify where performance gains can be made. With new Intel tools, it is now possible to automate this task, as well as base the estimates of peak performance on measurements rather than vendor specifications. The goal of this session is to demonstrate how the roofline feature of Intel Advisor can be used to balance memory vs. computation related optimization efforts and effectively identify performance bottlenecks. A series of typical optimization techniques: cache blocking, structure refactoring, data alignment, and vectorization illustrated by the kernel cases will be addressed. # Description of the codes ## XGC1 The XGC1 code [3] is a magnetic fusion Particle-In-Cell code that uses an unstructured mesh for its Poisson solver that allows it to accurately resolve the edge plasma of a magnetic fusion device. After recent optimizations to its collision kernel [4], most of the computing time is spent in the electron push (pushe) kernel, where these optimization efforts have been focused. The kernel code scaled well with MPI+OpenMP but had almost no automatic compiler vectorization, in part due to indirect memory addresses and in part due to low trip counts of low-level loops that would be candidates for vectorization. Particle blocking and sorting have been implemented to increase trip counts of low-level loops and improve memory locality, and OpenMP directives have been added to vectorize compute-intensive loops that were identified by Advisor. The optimizations have improved the performance of the pushe kernel 2x on Haswell processors and 1.7x on KNL. The KNL node-for-node performance has been brought to within 30% of a NERSC Cori phase I Haswell node and we expect to bridge this gap by reducing the memory footprint of compute intensive routines to improve cache reuse. ## PICSAR is a Fortran/Python high-performance Particle-In-Cell library targeting at MIC architectures first designed to be coupled with the PIC code WARP for the simulation of laser-matter interaction and particle accelerators. PICSAR also contains a FORTRAN stand-alone kernel for performance studies and benchmarks. A MPI domain decomposition is used between NUMA domains and a tile decomposition (cache-blocking) handled by OpenMP has been added for shared-memory parallelism and better cache management. The so-called current deposition and field gathering steps that compose the PIC time loop constitute major hotspots that have been rewritten to enable more efficient vectorization. Particle communications between tiles and MPI domain has been merged and parallelized. All considered, these improvements provide speedups of 3.1 for order 1 and 4.6 for order 3 interpolation shape factors on KNL configured in SNC4 quadrant flat mode. Performance is similar between a node of cori phase 1 and KNL at order 1 and better on KNL by a factor 1.6 at order 3 with the considered test case (homogeneous thermal plasma).

Item not freely available? Link broken?
Report a problem accessing this item