Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Competitive and Universal Learning

Abstract

Modern data science calls for statistical inference algorithms that are both data-efficient and computation-efficient. We design and analyze methods that 1) outperform existing algorithms in the high-dimensional setting; 2) are as simple as possible but efficiently computable.

One line of our work aims to bridge different fields and offer unified solutions to multiple questions. Consider the following canonical statistical inference tasks: distribution estimation, functional estimation, and property testing, sharing the model that provides sample access to an unknown discrete distribution. In a recent paper in NeurIPS '19, we showed that a single, simple, and unified estimator – profile maximum likelihood (PML), and its near-linear time computable variant are sample-optimal for estimating multiple distribution attributes. The result covers 1) any appropriately Lipschitz and additively separable functionals; 2) sorted distribution; 3) R enyi entropy; 4) $\ell_2$ distance to the uniform distribution, yielding an optimal tester for distributions' closeness. This work makes PML the first unified sample- and time-optimal method for the learning tasks mentioned above. A single algorithm with such broad applicability is \emph{universal}.

Another line of our work focuses on instance-optimal learning that designs algorithms with near-optimal guarantees for every possible data input. A flagship problem is distribution estimation over discrete or continuous domains, where ordering and geometry play an essential role. Going beyond worst-case guarantees, researchers designed algorithms that compete with a genie estimator that knows the actual distribution but is reasonably restricted. To obtain state-of-the-art algorithms for both tasks, we leveraged the simple but nontrivial idea of ``data binning''. For discrete settings, we group symbols that appear the same number of times. And for continuous settings, we partition the real domain and separate symbols according to pre-designed local quantiles. The respective algorithms run in near-linear-time, achieve the best-known estimation guarantees regarding the genie estimators, and appear in ICML'19 and NeurIPS '20. A genie-like algorithm adaptive to almost every data sample is \emph{competitive}.

We present a comprehensive understanding of universal and competitive algorithms for multiple fundamental learning problems. Our ideas and techniques may shed light on key challenges in modern data science and numerous applications beyond the scope of this dissertation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View