Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Information-theoretic and hypothesis-based clustering in bioinformatics

Abstract

Many machine learning problems in biology involve clustering data generated in complex or incompletely understood ways. Processes such as protein and viral evolution are difficult to model, involving complex mechanisms and constraints at multiple levels. This thesis presents a family of clustering algorithms, based on the Information Bottleneck method, to cluster such datasets by imposing constraints related to statistical tests of their known properties. The first algorithm clusters continuous data; we apply it to amino acid profiles to derive a compact discrete representation that preserves much of their information. This discretization yields an easily interpretable textual representation of amino acid profiles. It also greatly improves the speed of profile- profile alignment, and makes it possible to index large profile databases. The second algorithm clusters discrete sequences while constraining mutual information between sequence positions within each cluster. We apply it to the problem of finding population substructure in viral and human SNP data, showing it to be competitive with or superior to current approaches. Biological datasets often strain the limits of modern computers, and advances in biotechnology promise to generate even more data in the future as computational power increases. We therefore present a randomized clustering algorithm for discrete sequences that is similar to the previous algorithm but scalable to much larger datasets. This clustering algorithm relies on statistical tests to perform structure learning, an approach that has the added benefit of naturally limiting model complexity. We use this algorithm to produce detailed phylogenies of large DNA mobile element families. Our results provide a more detailed picture of their history, and their important role in genomic evolution.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View