Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Exploitation of Metadata in Molecular Genomics Studies

Abstract

There is a great deal of interest in analyzing very large data sets in the biomedical sciences. This is due to the availability of high-throughput assays, such as DNA sequencing technologies and high-resolution imaging devices, advances in data storage and high-performance computing, and analytic techniques rooted in artificial intelligence and machine learning. However, many modern data sets are constructed from individual component data sets which create issues for data harmonization and scientific integration. ‘Metadata,’ i.e., data about the data within component data sets, can be used to facilitate integration and drawing inferences from the combined data sets, but requires care and is sensitive to how those data can be used. Metadata also arises in many situations in which the combination of data sets has more subtle and nuanced aspects to it, such as in analyzing species differences in evolutionary studies, where the species data are often collected independently with different techniques, making it important to know what specific protocols and techniques were used in order to organize and enable relevant comparisons and avoid batch effects, false positives, and other phenomena associated with heterogeneous data sets. I describe the application of statistical methods in four different contexts in which metadata are available. First, I describe an analysis involving the classification of emotions recorded as part of a digital therapeutic implemented in smart phone app designed to reduce stress. Meta data arise when considering the sources and settings of individual data collections. Second, I consider an analysis relating fibroblast transcriptomes to longevity across 49 avian species, where each species has a unique genome, but only a subset of species actually have available reference genomes. Third, I describe studies exploring variation in single cell gene expression patterns from studies of the human brain using expression profiles generated with different protocols and which have different quality control profiles. Fourth, I consider the analysis of genetically-mediated drug targets for longevity in which information from different sources is used to make more compelling and comprehensive statements of the candidacy of any one gene for drug development. I also consider general themes about the use of metadata in contemporary biomedical sciences and discuss areas for future research.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View