Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Simplifying Data Preparation for Machine Learning on Tabular Data

Abstract

Machine learning (ML) over tabular data has become ubiquitous with applications in many domains. This success has led to the rise of ML platforms, including automated ML (AutoML) platforms to manage the end-to-end ML workflow. The tedious grunt work involved in data preparation (prep) reduces data scientist productivity and slows down the ML development lifecycle, which makes the automation of data prep even more critical. While many works have looked into feature engineering and model selection in the end-to-end ML workflows, little attention has been paid towards understanding data prep and its utility for ML. Also, automating data prep remains challenging due to several reasons such as semantic gaps and lack of waysto objectively measure accuracy. In this dissertation, we take a step towards addressing such challenges using database schema management and ML techniques to simplify, better automate, and understand the utility of ML data prep. We create new benchmark datasets, methodology for benchmarking and automating ML data prep, and devise novel empirical analyses to characterize the significance of critical data prep steps. Our work presents several critical artifacts that not only provide a systematic approach to reduce grunt work and improve the productivity of ML practitioners but also can help establish the science of building (Auto)ML platforms. Our work opens up several new research directions at the intersection of ML, data management, and ML system design.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View