Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

A Study of High-dimensional Clustering and Statistical Inference on Networks

Abstract

Clustering is an important unsupervised classification technique. In supervised classification, we are provided with a collection of labeled (pre-classified) patterns and the problem is to label a newly encountered, yet unlabeled, pattern. At first, we consider clustering in Euclidean space in large dimensions. Then, we delve into the discrete setting of networks. We go into the issues related to network modeling and then into a specific method of clustering in networks.

In the first chapter, we consider the problem of estimation and deriving theoretical properties of the estimators for the elliptical distributions. The class of elliptical distributions have distributions with varied tail behavior. So, estimation under class of elliptic distributions lead to automatic robust estimators. The goal of the chapter is to propose efficient and adaptive regularized estimators for the nonparametric component, mean and covariance matrix of the elliptical distributions in both high and fixed dimensional situations. An algorithm for regularized estimation of mixture of elliptical distributions will also lead to an algorithm for finding elliptical clusters in high dimensional space and such an approach is also given in the chapter. In clustering, one of the main challenges is the detection of number of clusters. Most clustering algorithms need the number of clusters to be specified beforehand. In chapter two, we propose a new method of selecting number of clusters, based on hypothesis testing.

The study of networks has received increased attention recently not only from the social sciences and statistics but also from physicists,computer scientists and mathematicians. But a proper statistical analysis of features of different stochastic models of networks is still underway. In chapter three, we give an account of different network models and then we analyze a specific nonparametric model for networks. We consider the nonparametric estimate of link probabilities in dense social graphs in the context of network modeling and exploratory statistics.

In chapter four, we also propose bootstrap methods for finding empirical distribution of count features or `moments' and smooth functions of these for the networks. Using these methods, we can not only estimate variance of count features but also get good estimates of such feature counts, which are usually expensive to compute numerically in large networks. In our paper, we prove theoretical properties of the bootstrap variance estimates of the count features as well as show their efficacy through simulation. We also use the method on publicly available Facebook network data for estimate of variance and expectation of some count features.

In chapter five, we propose a clustering or community detection scheme for networks. One of the principal problem in networks is community detection. Many algorithms have been proposed for community finding but most of them do not have have theoretical guarantee for sparse networks and networks close to phase transition boundary proposed by physicists. There are some exceptions but all have incomplete theoretical basis. Here we propose an algorithm based on the graph distance of vertices in the network. We give theoretical guarantees that our method works in identifying communities for sparse block models. We illustrate on a network of political blogs, Facebook networks and some other networks.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View