<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <atom:link rel="self" type="application/rss+xml" href="https://escholarship.org/uc/coe_ece/rss"/>
    <ttl>720</ttl>
    <title>Recent coe_ece items</title>
    <link>https://escholarship.org/uc/coe_ece/rss</link>
    <description>Recent eScholarship items from Electrical &amp; Computer Engineering</description>
    <pubDate>Fri, 15 May 2026 04:53:17 +0000</pubDate>
    <item>
      <title>Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms</title>
      <link>https://escholarship.org/uc/item/5mv1s1gg</link>
      <description>Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/5mv1s1gg</guid>
      <pubDate>Tue, 10 Jun 2025 00:00:00 +0000</pubDate>
      <author>
        <name>Lin, Zhongyi</name>
      </author>
      <author>
        <name>Sun, Ning</name>
      </author>
      <author>
        <name>Bhattacharya, Pallab</name>
      </author>
      <author>
        <name>Feng, Xizhou</name>
      </author>
      <author>
        <name>Feng, Louis</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Optimized GPU Implementation of Grid Refinement in Lattice Boltzmann Method</title>
      <link>https://escholarship.org/uc/item/0x86w4w1</link>
      <description>Optimized GPU Implementation of Grid Refinement in Lattice Boltzmann Method</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/0x86w4w1</guid>
      <pubDate>Tue, 2 Apr 2024 00:00:00 +0000</pubDate>
      <author>
        <name>Mahmoud, Ahmed H.</name>
      </author>
      <author>
        <name>Salehipour, Hesam</name>
      </author>
      <author>
        <name>Meneghin, Massimiliano</name>
      </author>
    </item>
    <item>
      <title>Dynamic Mesh Processing on the GPU</title>
      <link>https://escholarship.org/uc/item/1sm051d2</link>
      <description>We propose a system for dynamic triangle mesh processing entirely on the GPU. Our system offers an efficient data structure that allows fast updates of the underlying mesh connectivity and attributes. Our data structure partitions the mesh into small patches which allows processing all dynamic updates for each patch within the GPU's fast shared memory. This allows us to rely on &lt;em&gt;speculative processing&lt;/em&gt; for conflict handling, which has low rollback cost while maximizing parallelism and reducing the cost of locking. Our system also introduces a new programming model for dynamic mesh processing. The programming model offers concise semantics for dynamic updates, relieving the user from having to worry about conflicting updates in the context of parallel execution. Our programming model relies on the &lt;em&gt;cavity operator&lt;/em&gt;, which is a general mesh update operator that formulates any dynamic operation as an element reinsertion by removing a set of mesh elements and inserting...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/1sm051d2</guid>
      <pubDate>Mon, 29 Jan 2024 00:00:00 +0000</pubDate>
      <author>
        <name>Mahmoud, Ahmed H.</name>
      </author>
      <author>
        <name>Porumbescu, Serban D.</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Maximum Clique Enumeration on the GPU</title>
      <link>https://escholarship.org/uc/item/7j96s061</link>
      <description>We present an iterative breadth-first approach to maximum clique enumeration on the GPU. The memory required to store all of the intermediate clique candidates poses a significant challenge. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting tradeoff between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/7j96s061</guid>
      <pubDate>Fri, 3 Nov 2023 00:00:00 +0000</pubDate>
      <author>
        <name>Geil, Afton</name>
      </author>
      <author>
        <name>Porumbescu, Serban D</name>
      </author>
      <author>
        <name>Owens, John D</name>
      </author>
    </item>
    <item>
      <title>Building a Performance Model for Deep Learning Recommendation Model Training on GPUs</title>
      <link>https://escholarship.org/uc/item/6rt535s6</link>
      <description>We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/6rt535s6</guid>
      <pubDate>Tue, 29 Nov 2022 00:00:00 +0000</pubDate>
      <author>
        <name>Lin, Zhongyi</name>
      </author>
      <author>
        <name>Feng, Louis</name>
      </author>
      <author>
        <name>Ardestani, Ehsan K.</name>
      </author>
      <author>
        <name>Lee, Jaewon</name>
      </author>
      <author>
        <name>Lundell, John</name>
      </author>
      <author>
        <name>Kim, Changkyu</name>
      </author>
      <author>
        <name>Kejariwal, Arun</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs</title>
      <link>https://escholarship.org/uc/item/9v75738g</link>
      <description>Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/9v75738g</guid>
      <pubDate>Thu, 7 Jul 2022 00:00:00 +0000</pubDate>
      <author>
        <name>Lin, Zhongyi</name>
      </author>
      <author>
        <name>Georganas, Evangelos</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Neon: A Multi-GPU Programming Model for Grid-based Computations</title>
      <link>https://escholarship.org/uc/item/9fz7k633</link>
      <description>We present Neon, a new programming model for grid-based computation with an intuitive, easy-to-use interface that allows domain experts to take full advantage of single-node multi-GPU systems. Neon decouples data structure from computation and back end configurations, allowing the same user code to operate on a variety of data structures and devices. Neon relies on a set of hierarchical abstractions that allow the user to write their applications as if they were sequential applications, while the runtime handles distribution across multiple GPUs and performs optimizations such as overlapping computation and communication without user intervention. We evaluate our programming model on several applications: a Lattice Boltzmann fluid solver, a finite-difference Poisson solver and a finite-element linear elastic solver. We show that these applications can be implemented concisely and scale well with the number of GPUs—achieving more than 99% of ideal efficiency.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/9fz7k633</guid>
      <pubDate>Fri, 11 Feb 2022 00:00:00 +0000</pubDate>
      <author>
        <name>Meneghin, Massimiliano</name>
      </author>
      <author>
        <name>Mahmoud, Ahmed H.</name>
      </author>
      <author>
        <name>Jayaraman, Pradeep Kumar</name>
      </author>
      <author>
        <name>Morris, Nigel J. W.</name>
      </author>
    </item>
    <item>
      <title>RXMesh:&amp;nbsp;A&amp;nbsp;GPU&amp;nbsp;Mesh&amp;nbsp;Data&amp;nbsp;Structure</title>
      <link>https://escholarship.org/uc/item/8r5848vp</link>
      <description>We propose a new static high-performance mesh data structure for triangle surface meshes on the GPU. Our data structure is carefully designed for parallel execution while capturing mesh locality and confining data access, as much as possible, within the GPU's fast shared memory. We achieve this by subdividing the mesh into&amp;nbsp;&lt;em&gt;patches&lt;/em&gt;&amp;nbsp;and representing these patches compactly using a matrix-based representation. Our patching technique is decorated with&amp;nbsp;&lt;em&gt;ribbons&lt;/em&gt;, thin mesh strips around patches that eliminate the need to communicate between different computation thread blocks, resulting in consistent high throughput. We call our data structure RXMesh: Ribbon-matriX Mesh. We hide the complexity of our data structure behind a flexible but powerful programming model that helps deliver high performance by inducing load balance even in highly irregular input meshes. We show the efficacy of our programming model on common geometry processing applications—mesh...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/8r5848vp</guid>
      <pubDate>Thu, 13 May 2021 00:00:00 +0000</pubDate>
      <author>
        <name>Mahmoud, Ahmed H.</name>
      </author>
      <author>
        <name>Porumbescu, Serban D.</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Dynamic Graphs on the GPU</title>
      <link>https://escholarship.org/uc/item/48j4k7np</link>
      <description>We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/48j4k7np</guid>
      <pubDate>Tue, 11 Feb 2020 00:00:00 +0000</pubDate>
      <author>
        <name>Awad, Muhammad A.</name>
      </author>
      <author>
        <name>Ashkiani, Saman</name>
      </author>
      <author>
        <name>Porumbescu, Serban D.</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection</title>
      <link>https://escholarship.org/uc/item/4sk284kw</link>
      <description>We benchmark several widely-used deep learning frameworks and investigate the FPGA deployment for performing traffic sign classification and detection. We evaluate the training speed and inference accuracy of these frameworks on the GPU by training FPGA-deployment-suitable models with various input sizes on GTSRB, a traffic sign classification dataset. Then, selected trained classification models and various object detection models that we train on GTSRB's detection counterpart (i.e., GTSDB) are evaluated with inference speed, accuracy, and FPGA power efficiency by varying different parameters such as floating-point precisions, batch sizes, etc. We discover that Neon and MXNet deliver the best training speed and classification accuracy on the GPU in general for all test cases, while TensorFlow is always among the frameworks with the highest inference accuracies. We observe that with the current OpenVINO release, the performance of lightweight models (e.g., MobileNet-v1-SSD, etc)...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/4sk284kw</guid>
      <pubDate>Mon, 8 Jul 2019 00:00:00 +0000</pubDate>
      <author>
        <name>Lin, Zhongyi</name>
      </author>
      <author>
        <name>Yih, Matthew</name>
      </author>
      <author>
        <name>Ota, Jeffrey M.</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
      <author>
        <name>Muyan-Ozcelik, Pinar</name>
      </author>
    </item>
    <item>
      <title>High-Performance Linear Algebra-based Graph Framework on the GPU</title>
      <link>https://escholarship.org/uc/item/37j8j27d</link>
      <description>High-Performance Linear Algebra-based Graph Framework on the GPU</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/37j8j27d</guid>
      <pubDate>Mon, 8 Jul 2019 00:00:00 +0000</pubDate>
      <author>
        <name>Yang, Carl Y</name>
      </author>
    </item>
    <item>
      <title>Graph Coloring on the GPU</title>
      <link>https://escholarship.org/uc/item/6kp4p18t</link>
      <description>We design and implement parallel graph coloring algorithms on the GPU using two different abstractions—one datacentric (Gunrock), the other linear-algebra-based (GraphBLAS). We analyze the impact of variations of a baseline independent-set algorithm on quality and runtime. We study how optimizations such as hashing, avoiding atomics, and a max-min heuristic affect performance. Our Gunrock graph coloring implementation has a peak 2x speed-up, a geomean speed-up of 1.3x and produces 1.6x more colors over previous hardwired state-of-theart implementations on real-world datasets. Our GraphBLAS implementation of Luby’s algorithm produces 1.9x fewer colors than the previous state-of-the-art parallel implementation at the cost of 3x extra runtime, and 1.014x fewer colors than a greedy, sequential algorithm with a geomean speed-up of 2.6x.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/6kp4p18t</guid>
      <pubDate>Sat, 1 Jun 2019 00:00:00 +0000</pubDate>
      <author>
        <name>Osama, Muhammad</name>
      </author>
      <author>
        <name>Truong, Minh</name>
      </author>
      <author>
        <name>Yang, Carl</name>
      </author>
      <author>
        <name>Buluç, Aydın</name>
      </author>
      <author>
        <name>Owens, John D</name>
      </author>
    </item>
    <item>
      <title>Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset</title>
      <link>https://escholarship.org/uc/item/7dc8d5vb</link>
      <description>We benchmark several widely used deep-learning frameworks for performing deep-learning-related automotive tasks (e.g., traffic sign recognition) that need to achieve realtime and high accuracy results with limited resources available on embedded platforms such as FPGAs. In our benchmarks, we use various input image sizes on models that are suitable for FPGA deployment, and investigate the training speed and inference accuracy of selected frameworks for these different sizes on a popular traffic sign recognition dataset. We report results by running the frameworks solely on the CPU as well as by turning on GPU acceleration. We also provide optimizations we apply to fine-tune the performance of the frameworks. We discover that Neon and MXNet deliver the best training speed and inference accuracy in general for all our test cases, while Tensorflow is always among the frameworks with the highest inference accuracies. We also observe that on the particular dataset we tested on (i.e.,...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/7dc8d5vb</guid>
      <pubDate>Thu, 10 May 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Lin, Zhongyi</name>
      </author>
      <author>
        <name>Ota, Jeffrey M.</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
      <author>
        <name>Muyan-Ozcelik, Pinar</name>
      </author>
    </item>
    <item>
      <title>Scalable Breadth-First Search on a GPU Cluster</title>
      <link>https://escholarship.org/uc/item/9bd842z6</link>
      <description>On a GPU cluster, the ratio of high computing power to communication bandwidth makes scaling breadth-first search (BFS) on a scale-free graph extremely challenging. By separating high and low out-degree vertices, we present an implementation with scalable computation and a model for scalable communication for BFS and direction-optimized BFS. Our communication model uses global reduction for high-degree vertices, and point-to-point transmission for low-degree vertices. Leveraging the characteristics of degree separation, we reduce the graph size to one third of the conventional edge list representation. With several other optimizations, we observe linear weak scaling as we increase the number of GPUs, and achieve 259.8 GTEPS on a scale-33 Graph500 RMAT graph with 124 GPUs on the latest CORAL early access system.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/9bd842z6</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Pan, Yuechao</name>
      </author>
      <author>
        <name>Pearce, Roger</name>
      </author>
      <author>
        <name>Owens, John D.</name>
      </author>
    </item>
    <item>
      <title>GPU LSM: A Dynamic Dictionary Data Structure for the GPU</title>
      <link>https://escholarship.org/uc/item/65t741zg</link>
      <description>We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array isalmost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/65t741zg</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Ashkiani, Saman</name>
      </author>
      <author>
        <name>Li, Shengren</name>
      </author>
      <author>
        <name>Farach-Colton, Martin</name>
      </author>
      <author>
        <name>Amenta, Nina</name>
      </author>
      <author>
        <name>Owens, John D</name>
      </author>
    </item>
    <item>
      <title>Parallel Algorithms and Dynamic Data Structures on the Graphics Processing Unit: a warp-centric approach</title>
      <link>https://escholarship.org/uc/item/5qd0r4ws</link>
      <description>&lt;p&gt;Graphics Processing Units (GPUs) are massively parallel processors with thousands of active threads originally designed for throughput-oriented tasks.In order to get as much performance as possible given the hardware characteristics of GPUs, it is extremely important for programmers to not only design an efficient algorithm with good enough asymptotic complexities, but also to take into account the hardware limitations and preferences.In this work, we focus our design on two high level abstractions: work assignment and processing. The former denotes the assigned task by the programmer to each thread or group of threads. The latter encapsulates the actual execution of assigned tasks.&lt;/p&gt;&lt;p&gt;Previous work conflates work assignment and processing into similar granularities. The most traditional way is to have per-thread work assignment followed by per-thread processing of that assigned work. Each thread sequentially processes a part of input and then the results are combined appropriately....</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/5qd0r4ws</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Ashkiani, Saman</name>
      </author>
    </item>
    <item>
      <title>Quotient Filters: Approximate Membership Queries on the GPU</title>
      <link>https://escholarship.org/uc/item/3v12f7dn</link>
      <description>&lt;p&gt;In this paper, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more well-known Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. This paper describes our GPU implementation of two types of quotient filters: the standard quotient filter and the rank-and-select-based quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for non-associative operators. One outcome of this work is a variety...</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/3v12f7dn</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Geil, Afton</name>
      </author>
      <author>
        <name>Farach-Colton, Martin</name>
      </author>
      <author>
        <name>Owens, John D</name>
      </author>
    </item>
    <item>
      <title>A Dynamic Hash Table for the GPU</title>
      <link>https://escholarship.org/uc/item/2p48q0zg</link>
      <description>A Dynamic Hash Table for the GPU</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/2p48q0zg</guid>
      <pubDate>Tue, 13 Mar 2018 00:00:00 +0000</pubDate>
      <author>
        <name>Ashkiani, Saman</name>
      </author>
      <author>
        <name>Farach-Colton, Martin</name>
      </author>
      <author>
        <name>Owens, John D</name>
      </author>
    </item>
    <item>
      <title>High Efficiency Micromachined Sub-THz Channels for Low Cost Interconnect for Planar Integrated Circuits</title>
      <link>https://escholarship.org/uc/item/9vm907zv</link>
      <description>High Efficiency Micromachined Sub-THz Channels for Low Cost Interconnect for Planar Integrated Circuits</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/9vm907zv</guid>
      <pubDate>Tue, 19 Jan 2016 00:00:00 +0000</pubDate>
      <author>
        <name>Yu, Bo</name>
      </author>
      <author>
        <name>Liu, Yuhao</name>
      </author>
      <author>
        <name>Ye, Yu</name>
      </author>
      <author>
        <name>Liu, Xiaoguang</name>
      </author>
      <author>
        <name>Gu, Qun</name>
      </author>
    </item>
    <item>
      <title>Extension of the Hot-Switching Reliability of RF-MEMS Switches Using A Series ContactProtection Technique</title>
      <link>https://escholarship.org/uc/item/8413c9hb</link>
      <description>Extension of the Hot-Switching Reliability of RF-MEMS Switches Using A Series ContactProtection Technique</description>
      <guid isPermaLink="true">https://escholarship.org/uc/item/8413c9hb</guid>
      <pubDate>Tue, 19 Jan 2016 00:00:00 +0000</pubDate>
      <author>
        <name>Liu, Yuhao</name>
      </author>
      <author>
        <name>Bey, Yusha</name>
      </author>
      <author>
        <name>Liu, Xiaoguang "Leo"</name>
      </author>
    </item>
  </channel>
</rss>
