Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Enhancing Apache AsterixDB for Efficient Big Data Search and Analytics

Abstract

In a typical minute of a day in 2018, the Internet generates 3,138 terabytes of traffic, Twitter users send 473,000 tweets, and two million snaps are sent on Snapchat. By 2020, it is estimated that for each person on earth, 1.7 MB of data will be created every second on the average. Due to the large volumes of Big Data, efficient search methods and analytics are required to explore such data. Thus, there is a clear need for Big Data management system, such as Apache AsterixDB, to enable users and applications to search to explore Big Data.

Initiated in 2009, the AsterixDB project integrated ideas from three distinct areas - semi-structured data, parallel databases, and data-intensive computing - to create an open-source software platform that scales on large, shared-nothing commodity computing clusters. AsterixDB currently provides various types of index, such as B+-tree, R-tree, and inverted indexes to fetch data efficiently. Also, as the problem of supporting similarity queries has become increasingly important in many applications, AsterixDB also supports similarity query processing using various metrics and provides an efficient similarity join method. It also provides various search and fundamental analytical functions. It can utilize external third-party libraries using user-defined functions to augment its functionalities. Following the release of the first public open-source version of AsterixDB in 2013, we identified several optics that needed to be explored in depth to enhance AsterixDB further. Those topics are the focus of this thesis.

We first share our memory management experiences in AsterixDB. We describe the original implementation of the system's memory-intensive operations and a set of design flaws (oversights) related to memory management that we found later. We then discuss how we have addressed each of those oversights. We also discuss AsterixDB's memory management at the global level. We believe that future Big Data management system builders can benefit from our memory management experiences. With memory management under control, we next present the design and implementation of index-only query plans in AsterixDB. Use of these plans can boost the performance of an index-based search by several orders of magnitude compared to a scan-based or non-index-only approach. We discuss the challenges that we faced regarding the implementation of index-only query plans in AsterixDB and how we addressed these challenges.

Lastly, considering the importance of similarity query processing for Big Data searches and analytics, we evaluate the performance of similarity query processing in AsterixDB. We compare its approach to several other systems and report the efficacy and performance results.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View