Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Statistical Learning Methods to Identify Latent Patterns in Sequential and High-dimensional Biological Data

No data is associated with this publication.
Abstract

Rapid advances in biological technology have resulted in a wealth of data. One intriguing topic mixes large-scale data from the current era with traditional small-scale data. That is, when laboratory experiments are unable to generate true labels at a rate that matches the rate of data synthesis, it raises the need for new statistical methods to detect latent patterns and predict large volumes of unlabelled data using the information contained in the existing labels, which are of relatively small size. We studied two problems of this kind. To extract latent patterns and identify functionally similar polymer sequences in unsupervised sequential datasets, we developed a scalable probabilistic graphical model and efficient stochastic variational methods. It provides consistent results on proton transport performance compared to wet lab experiments, and exhibits the potential to aid in predictable synthesized polymer performance. To extract latent patterns and identify functionally similar genes in semi-supervised high-dimensional datasets, we proposed a computational framework involving dimension reduction, clustering, sub-sampling and bagging-inspired aggregation, named GeneFishing. It successfully identified genes relevant to certain biological processes in a context-specific manner. This thesis shows efforts to integrate classic statistical concepts and modern large-scale machine learning techniques to estimate and predict from rapidly growing datasets, and to extract hidden patterns from a variety of data types and problem formulations.

Main Content

This item is under embargo until February 28, 2026.