Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Toward a More Accurate Genome: Algorithms for the Analysis of High-Throughput Sequencing Data

Abstract

High-throughput sequencing enables basic and translational biology to query the mechanics of both life and disease at single-nucleotide resolution and with breadth that spans the genome. This revolutionary technology is a major tool in biomedical research, impacting our understanding of life's most basic mechanics and affecting human health and medicine. Unfortunately, this important technology produces very large, error-prone datasets that require substantial computational processing before experimental conclusions can be made. Since errors and hidden biases in the data may influence empirically-derived conclusions, accurate algorithms and models of the data are critical. This thesis focuses on the development of statistical models for high-throughput sequencing data which are capable of handling errors and which are built to reflect biological realities.

First, we focus on increasing the fraction of the genome that can be reliably queried in biological experiments using high-throughput sequencing methods by expanding analysis into repeat regions of the genome. The method allows partial observation of the gene regulatory network topology through identification of transcription factor binding sites using Chromatin Immunoprecipitation followed by high-throughput sequencing (ChIP-seq). Binding site clustering, or "peak-calling", can be frustrated by the complex, repetitive nature of genomes. Traditionally, these regions are censored from any interpretation, but we re-enable their interpretation using a probabilistic method for realigning problematic DNA reads.

Second, we leverage high-throughput sequencing data for the empirical discovery of underlying epigenetic cell state, enabled through analysis of combinations of histone marks. We use a novel probabilistic model to perform spatial and temporal clustering of histone marks and capture mark combinations that correlate well with cell activity. A first in epigenetic modeling with high-throughput sequencing data, we not only pool information across cell types, but directly model the relationship between them, improving predictive power across several datasets.

Third, we develop a scalable approach to genome assembly using high-throughput sequencing reads. While several assembly solutions exist, most don't scale well to large datasets, requiring computers with copious memory to assemble large genomes. Throughput continues to increase and the large datasets available today and in the near future will require truly scalable methods. We present a promising distributed method for genome assembly which distributes the de Bruijn graph across many computers and seamlessly spills to disk when main memory is insufficient. We also show novel graph cleaning algorithms which should handle increased errors from large datasets better than traditional graph structure-based cleaning.

High-throughput sequencing plays an important role in biomedical research, and has already affected human health and medicine. Future experimental procedures will continue to rely on statistical methods to provide crucial error and bias correction, in addition to modeling expected outcomes. Thus, further development of robust statistical models is critical to the future high-throughput sequencing, ensuring a strong foundation for correct biological conclusions.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View