Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Learning from Subsampled Data: Active and Randomized Strategies

Abstract

In modern statistical applications, we are often faced with situations

where there is either too little or too much data. Both extremes can

be troublesome: Interesting models can only be learnt when sufficient

amounts of data are available, yet these models tend to become

intractable when data is abundant. An important thread of research

addresses these difficulties by subsampling the data prior to learning

a model. Subsampling can be active (i.e. active learning) or

randomized. While both of these techniques have a long history, a

direct application to novel situations is in many cases

problematic. This dissertation addresses some of these issues.

We begin with an active learning strategy for spectral clustering when

the cost of assessing individual similarities is substantial or

prohibitive. We give an active spectral clustering algorithm which

iteratively adds similarities based on information gleaned from a

partial clustering and which improves over common alternatives.

Next, we consider active learning in Bayesian models. Complex Bayesian

models often require an MCMC-based method for inference, which makes a

naive application of common active learning strategies

intractable. We propose an approximate active learning method which

reuses samples from an existing MCMC chain in order to speed up the

computations.

Our third contribution looks at the effects of randomized subsampling

on Gaussian process models that make predictions about outliers and

rare events. Randomized subsampling risks making outliers even

rarer, which, in the context of Gaussian process models, can lead to

overfitting. We show that Heavy-tailed stochastic processes can be

used to improve robustness of regression and classification estimators

to such outliers by selectively shrinking them more strongly in sparse

regions than in dense regions.

Finally, we turn to a theoretical evaluation of randomized subsampling

for the purpose of inferring rankings of objects. We present two

simple algorithms that predict a total order over n objects from a

randomized subsample of binary comparisons. In expectation, the

algorithms match an &Omega(n) lower bound on the sample complexity

for predicting a permutation with fixed expected Kendall tau

distance. Furthermore, we show that given O(nlog(n)) samples, one

algorithm recovers the true ranking with uniform quality, while the

other predicts the ranking more accurately near the top than the

bottom. Due to their simple form, the algorithms can be easily

extended to online and distributed settings.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View