Jiang, Lin

Scalable Data-Parallel Processing of Semi-Structured Data

2021

Jiang, Lin
Advisor(s): Zhao, Zhijia

Creative Commons 'BY' version 4.0 license

Abstract

Semi-structured data, like JSON, XML, and their derivatives, are essential in modern computing infrastructures, from cloud computing and microservice to NoSQL data stores and Internet of Things (IoT). However, existing software often fails to process such types of data in a scalable way due to their nested structures and the lack of effective data-level parallelism. The goal of this thesis is mainly to address two fundamental scalability issues in semi-structured data processing. First, how can semi-structured data analytics effectively leverage the abundant hardware parallelism that are offered by modern computer architectures (e.g., multi-cores and SIMD operations)? Second, how can semi-structured data analytics improve the data access efficiency (i.e., locality) and reduce the memory consumption when handling large semi-structured datasets? To answer these questions, this thesis proposes a series of parallelization techniques and streaming computation models dedicated to semi-structured data.

More specifically, this thesis first proposes a grammar-aware parallelization (GAP) scheme for XPath query evaluation, which leverages the data grammar (e.g., DTD file for XML) to prune unnecessary paths during the enumerative execution. Then, it designs a streaming model for querying JSON data, which jointly compiles JSON grammar and JSONPath queries into a dual-stack pushdown automaton, and adopts a customized GAP scheme for its parallelization. Next, to effectively utilize both fine-grained (bitwise and SIMD) parallelism and coarse-grained (multi-core) parallelism, this thesis proposes a new design of bitwise structural index construction that is able to build leveled bitmaps for a single large JSON record in parallel, thanks to a set of parallelization techniques specialized to JSON structures. Finally, this thesis combines the ideas of bitwise index construction and the stream processing model, which brings in a novel on-demand parsing technique that can intelligently skip parsing irrelevant substructures of a JSON record based on the given JSON queries. All the above techniques have been systematically evaluated with real-world datasets and standard JSON/XML queries, and have demonstrated significant performance benefits over the existing solutions, in terms of both query evaluation time and memory consumption.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

Scalable Data-Parallel Processing of Semi-Structured Data