Query Optimization in the Era of Big Data Management Systems
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Query Optimization in the Era of Big Data Management Systems

Abstract

Over the past decade, a number of data intensive scalable systems have been developed to process extremely large collections of data. To process such data as efficiently as possible the need for big data query optimization has emerged. This thesis concentrates on two research problems within query optimization for big data.

The first part of this work aims at the use of cost-based decisions to efficiently process massive collections of data. Traditional query optimizers are cost-based and "upfront", using statistical estimates of intermediate result cardinalities to assign costs and pick the best query plan prior to its execution. However, such estimates tend to become less accurate in the presence of filtering conditions due either to undetected correlations between multiple predicates local to a single dataset, to predicates with query parameters, or to predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today's data, a suboptimal plan can quickly become very inefficient. To address this, we instead revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization process starts by first executing all predicates local to a single dataset and continues in stages (re-optimization points). The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces overhead for materializing these intermediate results, our experiments with industry benchmark data show that this overhead is relatively small and that it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches.

The second part of this thesis focuses on heuristic decisions that can improve the querying of JSON data. With the abundance of web data, JSON has become the de-facto data exchange format today. However, querying JSON data is not always efficient, especially in the case of large data repositories. In this part of our work we integrate the JSONiq extension of the XQuery language specification into an existing query processor (namely, Apache VXQuery) so as to enable the querying of JSON data in parallel. In particular, we have implemented three categories of rewrite rules that can efficiently handle path expressions in parallel along with introducing intra-query parallelism. An evaluation of our approach using a large (803GB) real dataset of sensor readings shows that the proposed rewrite rules lead to efficient and scalable parallel processing of JSON data.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View