Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

A Principled Approach to Trustworthy Machine Learning

Abstract

Traditional machine learning operates under the assumption that training and testing data are drawn independently from the same distribution. However, this assumption does not always hold. In this thesis, we take a principled approach toward three major challenges in settings where this assumption fails to hold -- i) robustness to adversarial inputs, ii) handling unseen examples during test time, and iii) avoiding learning spurious correlations.

We study what happens when small adversarial perturbations are made to the inputs. We investigate neural networks, which frequently operate on natural datasets such as images. We find that in these datasets, differently labeled examples are often far away from each other. Under this condition, we prove that a perfectly robust and accurate classifier exists, suggesting that there is no intrinsic tradeoff between adversarial robustness and accuracy on these datasets. Next, we look into non-parametric classifiers, which operate on datasets without such a separation condition. We design a defense algorithm -- adversarial pruning -- that successfully improves the robustness of many non-parametric classifiers, including k-nearest neighbor, decision tree, and random forest. Adversarial pruning can also be seen as a finite sample approximation to the classifier with the highest accuracy under robustness constraints. Finally, we connect robustness and interpretability on decision trees by designing an algorithm that is guaranteed to achieve good accuracy, robustness, and interpretability when the data is linearly separated.

Next, we explore what would happen if examples that do not belong in the training set, i.e., out-of-distribution (OOD) examples, are given to a model as input during testing time. We identify that neural networks tend to predict OOD inputs as the label of the closest training example, and adversarially robust networks amplify this behavior. These findings can shed light on many long-standing questions surrounding generalization, including how adversarial robust training methods change the decision boundary, why an adversarially robust network performs better on corrupted data, and when OOD examples can be hard to detect.

Finally, we investigate the case where a few examples of a certain class with spurious features are presented during training. We find that merely three of these spurious examples can cause the network to learn a spurious correlation. Our result suggests that neural networks are highly sensitive to small amounts of training data. Although this feature enables efficient learning, it also results in rapid learning of spurious correlation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View