Qiao, Weikang

Customized Computing: Acceleration of Big-Data Applications

2022

Abstract

Customized computing is gaining ever-increasing popularity in today’s data center to meet the demand for high computational capabilities and energy efficiency. Among the existing customized processors, field-programmable gate array (FPGA) is one of the most promising candidates due to its flexible re-programmability and low power consumption. However, it is still unclear how to design efficient accelerators that entirely exploit the hardware features and deliver high performance for many big-data applications. This dissertation focuses on using FPGAs to accelerate fundamental applications in data centers, specifically, sorting and compression, with the aim of providing high-performance sorting and compression solutions that can be scaled to fully take advantage of the hardware resources such as on-chip computing resources, off-chip memory bandwidth or I/O bandwidth to meet various application requirements. The first part of the thesis explores how merge tree sort can be implemented and optimized on FPGAs to work efficiently with different problem sizes ranging from MB to TB as well as various memory hierarchies such as DDR DRAM, high-bandwidth memory (HBM) and solid-state drive (SSD). On DRAM-based FPGAs, we show that a single merge tree can be optimally configured to saturate the off-chip memory bandwidth and minimize the number of merge passes. On HBM-based FPGAs, the performance bottleneck of the sorting acceleration is the limited on-chip resources. To alleviate the resource bottleneck, a two-phase merge tree sorter with resource sharing between the two phases is presented. Finally, to deal with the case that the problem size is larger than what can be held in the DRAM, we couple the merge tree with the emerging in-storage computing devices. The proposed in-storage sorter contains two separately optimized phases and can be re-programmed at runtime to switch between the phases. In the second part of the thesis, we investigate the FPGA-accelerated lossless compression. We first present a scalable high-throughput lossless compression accelerator design that achieves state-of-the-art compression speed. The core of this accelerator is a multi-way parallel architecture where each way represents a well-optimized and fully-pipelined streaming compression engine. The compression engines are further equipped with proper data feeding mechanism and chained hash tables to avoid the potential degradation of the compression quality. Moreover, we also explore the possibility of accelerating high-quality lossless compression, the core of which performs Burrows-Wheeler Transform (BWT) on a data window containing hundreds of kilobyte symbols. We find the anti-sequential suffix sorting algorithm could be a good fit for hardware implementation. We utilize the abundant on-chip memory resources of the FPGA to construct a linked list and propose several optimization to reduce the searching time of the list. The proposed design is the first FPGA-based BWT solution that supports BWT window sizes of hundreds of kilobytes.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Customized Computing: Acceleration of Big-Data Applications