Recent coe_ece items

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Tue, 10 Jun 2025 00:00:00 +0000

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank...

Optimized GPU Implementation of Grid Refinement in Lattice Boltzmann Method

Tue, 2 Apr 2024 00:00:00 +0000

Optimized GPU Implementation of Grid Refinement in Lattice Boltzmann Method

Dynamic Mesh Processing on the GPU

Mon, 29 Jan 2024 00:00:00 +0000

We propose a system for dynamic triangle mesh processing entirely on the GPU. Our system offers an efficient data structure that allows fast updates of the underlying mesh connectivity and attributes. Our data structure partitions the mesh into small patches which allows processing all dynamic updates for each patch within the GPU's fast shared memory. This allows us to rely on speculative processing for conflict handling, which has low rollback cost while maximizing parallelism and reducing the cost of locking. Our system also introduces a new programming model for dynamic mesh processing. The programming model offers concise semantics for dynamic updates, relieving the user from having to worry about conflicting updates in the context of parallel execution. Our programming model relies on the cavity operator, which is a general mesh update operator that formulates any dynamic operation as an element reinsertion by removing a set of mesh elements and inserting...

Maximum Clique Enumeration on the GPU

Fri, 3 Nov 2023 00:00:00 +0000

We present an iterative breadth-first approach to maximum clique enumeration on the GPU. The memory required to store all of the intermediate clique candidates poses a significant challenge. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting tradeoff between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Tue, 29 Nov 2022 00:00:00 +0000

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads...

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

Thu, 7 Jul 2022 00:00:00 +0000

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

Neon: A Multi-GPU Programming Model for Grid-based Computations

Fri, 11 Feb 2022 00:00:00 +0000

We present Neon, a new programming model for grid-based computation with an intuitive, easy-to-use interface that allows domain experts to take full advantage of single-node multi-GPU systems. Neon decouples data structure from computation and back end configurations, allowing the same user code to operate on a variety of data structures and devices. Neon relies on a set of hierarchical abstractions that allow the user to write their applications as if they were sequential applications, while the runtime handles distribution across multiple GPUs and performs optimizations such as overlapping computation and communication without user intervention. We evaluate our programming model on several applications: a Lattice Boltzmann fluid solver, a finite-difference Poisson solver and a finite-element linear elastic solver. We show that these applications can be implemented concisely and scale well with the number of GPUs—achieving more than 99% of ideal efficiency.

RXMesh: A GPU Mesh Data Structure

Thu, 13 May 2021 00:00:00 +0000

We propose a new static high-performance mesh data structure for triangle surface meshes on the GPU. Our data structure is carefully designed for parallel execution while capturing mesh locality and confining data access, as much as possible, within the GPU's fast shared memory. We achieve this by subdividing the mesh into patches and representing these patches compactly using a matrix-based representation. Our patching technique is decorated with ribbons, thin mesh strips around patches that eliminate the need to communicate between different computation thread blocks, resulting in consistent high throughput. We call our data structure RXMesh: Ribbon-matriX Mesh. We hide the complexity of our data structure behind a flexible but powerful programming model that helps deliver high performance by inducing load balance even in highly irregular input meshes. We show the efficacy of our programming model on common geometry processing applications—mesh...

Dynamic Graphs on the GPU

Tue, 11 Feb 2020 00:00:00 +0000

We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.

Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

Mon, 8 Jul 2019 00:00:00 +0000

We benchmark several widely-used deep learning frameworks and investigate the FPGA deployment for performing traffic sign classification and detection. We evaluate the training speed and inference accuracy of these frameworks on the GPU by training FPGA-deployment-suitable models with various input sizes on GTSRB, a traffic sign classification dataset. Then, selected trained classification models and various object detection models that we train on GTSRB's detection counterpart (i.e., GTSDB) are evaluated with inference speed, accuracy, and FPGA power efficiency by varying different parameters such as floating-point precisions, batch sizes, etc. We discover that Neon and MXNet deliver the best training speed and classification accuracy on the GPU in general for all test cases, while TensorFlow is always among the frameworks with the highest inference accuracies. We observe that with the current OpenVINO release, the performance of lightweight models (e.g., MobileNet-v1-SSD, etc)...

High-Performance Linear Algebra-based Graph Framework on the GPU

Mon, 8 Jul 2019 00:00:00 +0000

High-Performance Linear Algebra-based Graph Framework on the GPU

Graph Coloring on the GPU

Sat, 1 Jun 2019 00:00:00 +0000

We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.

Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset

Thu, 10 May 2018 00:00:00 +0000

We benchmark several widely used deep-learning frameworks for performing deep-learning-related automotive tasks (e.g., traffic sign recognition) that need to achieve realtime and high accuracy results with limited resources available on embedded platforms such as FPGAs. In our benchmarks, we use various input image sizes on models that are suitable for FPGA deployment, and investigate the training speed and inference accuracy of selected frameworks for these different sizes on a popular traffic sign recognition dataset. We report results by running the frameworks solely on the CPU as well as by turning on GPU acceleration. We also provide optimizations we apply to fine-tune the performance of the frameworks. We discover that Neon and MXNet deliver the best training speed and inference accuracy in general for all our test cases, while Tensorflow is always among the frameworks with the highest inference accuracies. We also observe that on the particular dataset we tested on (i.e.,...

Scalable Breadth-First Search on a GPU Cluster