Romanous, Bashar

On the Acceleration of Database Primitives on FPGAs

2021

Romanous, Bashar
Advisor(s): Najjar, Walid

Abstract

The decreasing cost of DRAM has made possible and grown the use of in-memory databases. However, memory latency has not kept up with increasing processing speed and has become the limiting performance factor. Large cache hierarchies are effective in mitigating this large memory latency for regular applications that exhibit spatial and/or temporal locality. Database operations by their very nature, rely extensively on indirect addressing of sparse data and hence cannot benefit from large cache memories.

This dissertation describes the usage of filaments as an approach that focuses on latency masking rather than latency mitigation. Filaments are lightweight hardware threads that can issue multiple outstanding memory requests then switch to waiting state and yield control to other filaments that have their data ready. Filaments are implemented and evaluated to accelerate to hash-join and hash-based group-by aggregation on FPGAs. Content Addressable Memories (CAMs) are used as a synchronization cache that enables processor-side locking.

Sorting is an essential component in a wide variety of applications including databases, data analytics, and graph processing. While comparison-based and distribution-based sorting algorithms are the two common methods of sorting, comparison-based sorting algorithms are the most common to be accelerated on FPGAs. As the technology evolves and new hardware architectures and platforms are introduced a re-evaluation of accelerating distribution-based sorting algorithms, such as Radix sort, using FPGAs is justified.

This dissertation describes a novel parallel in-memory Hardware Accelerated Radix Sort (HARS), that does not rely on sorting networks and avoids the performance limiting merge steps. The scalability of datasets is not restricted by the available on-chip memory and does provide constant throughout.

Sorting very large datasets efficiently using limited local memory is a challenging problem. The common method of sorting a large dataset is to divide it into smaller sub-arrays that can fit on local memory, sort the sub-arrays and then merge the resulting sub-arrays until the entire dataset is sorted. The merge steps become memory and performance bottlenecks as the size of merged sub-arrays grow to exceed the local memory capacity.

This dissertation explores a parallel FPGA implementation of sample sort, a cache-oblivious sorting algorithm that enables sorting very large datasets using local memory. A set of pivots is used to distribute keys from sorted sub-arrays into buckets that can fit on local memory. Sorting the buckets produces the completely sorted set. All steps of sample sort can be performed entirely using FPGAs local memory. Thus overcoming a limitation of the currently available sorting routines on FPGAs.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

On the Acceleration of Database Primitives on FPGAs