Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Statistical Inference for Large and Complex Data

No data is associated with this publication.
Abstract

Statistical inference aims to quantify the amount of uncertainty in parameters or functions estimated from a statistical procedure and lies at the heart of modern decision-making. The problem is, however, when data sets become large and high-dimensional, which is the case for many modern health-related applications (electronic health records, multiomics, imaging data, etc.), classical statistical inference tools fail due to computational and methodological issues. The problem is further exacerbated when data sets also exhibit dependency structures or nonignorable missingness due to censoring. This dissertation summarizes our effort in addressing some of these challenges.

Specifically, chapter 1 provides a bag of little bootstrap (BLB) based method for conducting statistical inference of linear mixed models on massive and distributed longitudinal data sets such as electronic health records. For the statistical inference of variance component parameters, our software package MixedModelsBLB.jl achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using a desktop computer.

Chapter 2 provides an extremely flexible and general framework called proximal Markov Chain Monte Carlo (ProxMCMC) for conducting statistical inference on constrained or regularized estimation procedures, which are indispensable for analyzing high-dimensional data and the inference of which has been considered difficult. Many frequently encountered statistical learning tasks such as constrained lasso, graphical lasso, matrix completion, and sparse low-rank matrix regression fall into this category.

Chapter 3 provides tools for the estimation and inference of heteroscedastic linear models for analyzing censored data using synthetic variables. Our motivating applications are adjusting for treatment effects in studies of quantitative traits and variance quantitative trait loci (vQTL) analysis, which arise frequently in genetic and epidemiological studies, but our method is general and computationally scalable to be applied to other fields of applications where censored data can arise from, for example, measurements that are out of the limit of detection.

Main Content

This item is under embargo until December 12, 2024.