Recent cbmb items

Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility

Wed, 8 Jul 2009 00:00:00 +0000

Variants in the CDKN2B and RTEL1 regions are associated with high-grade glioma susceptibility

Identification of yeast transcriptional regulation networks using

Wed, 13 May 2009 00:00:00 +0000

The recent availability of whole-genome scale data sets that investigate complementary and diverse aspects of transcriptional regulation has spawned an increased need for new and effective computational approaches to analyze and integrate these large scale assays. Here, we propose a novel algorithm, based on random forest methodology, to relate gene expression (as derived from expression microarrays) to sequence features residing in gene promoters (as derived from DNA motif data) and transcription factor binding to gene promoters (as derived from tiling microarrays). We extend the random forest approach to model a multivariate response as represented, for example, by time-course gene expression measures. An analysis of the multivariate random forest output reveals complex regulatory networks, which consist of cohesive, condition-dependent regulatory cliques. Each regulatory clique features homogeneous gene expression profiles and common motifs or synergistic motif groups. We...

A Novel Topology for Representing Protein Folds

Tue, 31 Mar 2009 00:00:00 +0000

Various topologies for representing three dimensional protein structures have been advanced for purposes ranging from prediction of folding rates to ab initio structure prediction. Examples include relative contact order, Delaunay tessellations, and backbone torsion angle distributions. Here we introduce a new topology based on a novel means for operationalizing three dimensional proximities with respect to the underlying chain. The measure involves first interpreting a rank-based representation of the nearest neighbors of each residue as a permutation, then determining how perturbed this permutation is relative to an unfolded chain. We show that the resultant topology provides improved association with folding and unfolding rates determined for a set of two-state proteins under standardized conditions. Furthermore, unlike existing topologies, the proposed geometry exhibits fine scale structure with respect to sequence position along the chain, potentially providing insights...

On E-values for Tandem MS Scoring Schemes

Tue, 12 Aug 2008 00:00:00 +0000

In a recent article in this journal, Khatun, Hamlett, and Giddings (2008) (KHG) advance a new scoring scheme for use in conjunction with tandem mass spectrometry (MS/MS) based peptide identification. As they note, such identifications are fundamental to much proteomics research but, due to MS/MS data complexity and the scale of attendant database searches, their accuracy is limited. The scoring technique they propose, which employs a hidden Markov model (HMM) over a set of states that represent key features of MS/MS data, is convincingly motivated and exhibits good performance. The purpose of this brief note is to critique the method chosen for calibrating the HMM scores, rather than the genesis of the scores themselves.

Re-Cracking the Nucleosome Positioning Code

Tue, 12 Aug 2008 00:00:00 +0000

Nucleosomes, the fundamental repeating subunits of all eukaryotic chromatin, are responsible for packaging DNA into chromosomes inside the cell nucleus and controlling gene expression. While it has been well established that nucleosomes exhibit higher affinity for select DNA sequences, until recently it was unclear whether such preferences exerted a significant, genome-wide effect on nucleosome positioning in vivo. This question was seemingly and recently resolved in the affirmative: a wide-ranging series of experimental and computational analyses provided extensive evidence that the instructions for wrapping DNA around nucleosomes are contained in the DNA itself. This subsequently labelled second genetic code was based on data-driven, structural, and biophysical considerations. It was subjected to an extensive suite of validation procedures, with one conclusion being that intrinsic, genome-encoded, nucleosome organization explains _50% of in vivo nucleosome positioning. Here,...

Selective Genotyping and Phenotyping Strategies in a Complex Trait Context

Mon, 11 Aug 2008 00:00:00 +0000

Selective genotyping and phenotyping strategies can reduce the cost of QTL (quantitative trait loci) experiments. We analyze selective genotyping and phenotyping strategies in the context of multi-locus models, and non-normal phenotypes. Our approach is based on calculations of the expected information of the experiment under different strategies. Our central conclusions are the following. (1) Selective genotyping is effective for detecting linked and epistatic QTL as long as no locus has a large effect. When one or more loci have large effects, the effectiveness of selective genotyping is unpredictable – it may be heightened or diminished relative to the small effects case. (2) Selective phenotyping efficiency decreases as the number of unlinked loci used for selection increases, and approaches random selection in the limit. However, when phenotyping is expensive, and a small fraction can be phenotyped, the efficiency of selective phenotyping is high compared to random sampling,...

A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays

Wed, 11 Jul 2007 00:00:00 +0000

Motivation: Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polymorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, e.g. use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.

Results: We developed an integrated multi-SNP, multi-array genotype calling algorithm for Affymetrix SNP arrays, MAMS, that combines single-array multi-SNP (SAMS) and multi-array, single- SNP (MASS) calls to improve...

Validation in Genomics: CpG Island Methylation Revisited

Wed, 10 Jan 2007 00:00:00 +0000

In a recent article in PLoS Genetics, Bock et al., (2006) undertake an extensive computational epigenetics analysis of the ability of DNA sequence-derived features, capturing attributes such as tetramer frequencies, repeats and predicted structure, to predict the methylation status of CpG islands. Their suite of analyses appears highly rigorous with regard to accompanying validation procedures, employing stringent Bonferroni corrections, stratified cross-validation, and follow-up experimental verification. Here, however, we showcase concerns with the validation steps, in part ascribable to the genome scale of the investigation, that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods. A series of new analyses of the same CpG island methylation data helps illustrate these issues, not just for this particular study, but also analogous investigations involving high-dimensional predictors with complex between-feature...

R/qtlDesign: Inbred Line Cross Experimental Design

Wed, 1 Nov 2006 00:00:00 +0000

An investigator planning a QTL (quantitative trait locus) experiment has to choose which strains to cross, the type of cross, genotyping strategies, and the number of progeny to raise and phenotype. To help make such choices, we have developed an interactive program for power and sample size calculations for QTL experiments, R/qtlDesign. Our software includes support for selective genotyping strategies, variable marker spacing, and tools to optimize information content subject to cost constraints, for backcross, intercross, and recombinant inbred lines from two parental strains. We review the impact of experimental design choices on the variance attributable to a segregating locus, the residual error variance, and the effective sample size. We give examples of software usage in real-life settings. The software is available at http://www.biostat.ucsf.edu/sen/software.html.

Chess, Chance and Conspiracy

Mon, 28 Aug 2006 00:00:00 +0000

Chess and chance are seemingly strange bedfellows. Luck and/or randomness have no apparent role in move selection when the game is played at the highest levels. However, when competition is at the ultimate level, that of the World Chess Championship (WCC), chess and conspiracy are not strange bedfellows, there being a long and colorful history of accusations levied between participants. One such accusation, frequently repeated, was that all the games in the 1985 WCC (Karpov vs Kasparov) were fixed and pre-arranged move-by-move. That this claim was advanced by a former World Champion, Bobby Fischer, argues that it ought be investigated. That the only published, concrete basis for this claim consists of an observed run of particular moves, allows this investigation to be performed using probabilistic and statistical methods. In particular, we employ imbedded finite Markov chains to evaluate run statistic distributions. Further, we demonstrate how both chess computers and game...

Prediction of Genomewide Conserved Epitope Profiles of HIV-1: Classifier Choice and Peptide Representation

Fri, 7 Jul 2006 00:00:00 +0000

Identification of peptides binding to Major Histocompatibility Complex (MHC) molecules is important for accelerating vaccine development and improving immunotherapy. Accordingly, a wide variety of prediction methods have been applied in this context. In this paper, we introduce (tree-based) ensemble classifiers for such problems and contrast their predictive performance with forefront existing methods for both MHC class I and class II molecules. In addition, we investigate the impact of differing peptide representation schemes on performance. Finally, classifier predictions are used to conduct genomewide scans of a diverse collection of HIV-1 strains, enabling assessment of epitope conservation. We investigated all combinations of six classifi- cation methods (classification trees, artificial neural networks, support vector machines, as well as the more recently devised ensemble methods (bagging, random forests, boosting) with four peptide representation schemes (amino acid...

Cluster Computing: When Many Hands Make Light Work

Fri, 15 Jul 2005 00:00:00 +0000

Many computations in biomedical research such as simulations, bootstrapping, database searches (such as BLAST), and many Monte Carlo algorithms are embarrassingly parallel. This means that the computation can be split up into smaller computations; each of those calculations can be performed in parallel threads that do not need to interact with each other. Computations with this feature can be easily distributed,(that is, run on different computer processors), with a gain in speed that is approximately proportional to the number of processors. In this note we introduce some of the concepts behind distributed computing, examples where they have been used, and lay out scenarios where they may be useful for biomedical researchers in the future.

Stepwise Normalization of Two-Channel Spotted Microarrays

Thu, 24 Mar 2005 00:00:00 +0000

Intensities measurements of spotted microarrays embody many undesirable systematic variations. Very commonly, varying amounts and types of such variations are observed in different arrays. Although various normalization methods have been proposed to remove such systematic effects, it has not been well studied how to assess or select the most appropriate method for different arrays and data sets. To address this issue, we present a novel normalization technique, STEPNORM, for data-dependent and adaptive normalization of two-channel spotted microarrays. STEPNORM performs a stepwise interrogation of a range of different normalization models and selects the appropriate method based on formal model selection criteria. In addition, we evaluate the effectiveness of STEPNORM and other commonly used normalization methods utilizing a set of specially constructed splicing arrays.

Identifying differentially expressed genes from microarray experiments via statistic synthesis

Thu, 24 Mar 2005 00:00:00 +0000

Motivation: A common objective of microarray experiments is the detection of differential gene expression between samples obtained under different conditions. The task of identifying differentially expressed genes consists of two aspects: ranking and selection. Numerous statistics have been proposed to rank genes in order of evidence for differential expression. However, no one statistic is universally optimal and there is seldom any basis or guidance that can direct toward a particular statistic of choice.

Results: Our new approach, which addresses both ranking and selection of differentially expressed genes, integrates differing statistics via a distance synthesis scheme. Using a set of (Affymetrix) spike-in data sets, in which differentially expressed genes are known, we demonstrate that our method compares favorably with the best individual statistics, while achieving robustness properties lacked by the individual statistics. We further evaluate performance on one...

Analysis of a Splice Array Experiment Elucidates Roles of Chromatin Elongation Factor Spt4-5 in Splicing

Thu, 24 Mar 2005 00:00:00 +0000

Background: Splicing is an important for regulation of gene expression in eukaryotes, and it has important functional links to other steps of gene expression. Two examples of these linkages include Ceg1, a component of the mRNA capping enzyme, and the chromatin elongation factors Spt4-5, both of which have recently been shown to play a role in the normal splicing of several genes in the yeast, S. cerevisiae.

Principal Findings: Using a genomic approach to characterize the roles of Spt4-5 in splicing, we extended our observations of splicing defects in ceg1, spt4 and spt5 mutants to the entire collection of intron-containing genes, employing splicing-sensitive DNA microarrays. In the context of the complex experiment design, highlighted by 22 dye-swap array hybridizations comprised of both biological and technical replications, we applied four ANOVA mixed models and a semiparametric hierarchical mixture model. To refine selection of differentially expressed genes whose...

QTL Study Design from an Information Perspective

Wed, 2 Feb 2005 00:00:00 +0000

We examine the efficiency of different genotyping and phenotyping strategies in inbred line crosses from an information perspective. This provides a mathematical framework for the statistical aspects of QTL experimental design, while guiding our intuition. Our central result is a simple formula that quantifies the fraction of missing information of any genotyping strategy in a backcross. It includes the special case of selectively genotyping only the phenotypic extreme individuals. The formula is a function of the square of the phenotype, and the uncertainty in our knowledge of the genotypes at a locus. This result is used to answer a variety of questions. First, we examine the cost-information tradeoff varying the density of markers, and the proportion of extreme phenotypic individuals genotyped. Then we evaluate the information content of selective phenotyping designs and the impact of measurement error in phenotyping. A simple formula quantifies the information content of...

Microarray Gene Expression Data with Linked Survival Phenotypes: Diffuse Large-B-Cell Lymphoma Revisited

Tue, 25 Jan 2005 00:00:00 +0000

Diffuse large-B-cell lymphoma (DLBCL) is an aggressive malignancy of mature B lymphocytes and is the most common type of lymphoma in adults. While treatment advances have been substantial in what was formerly a fatal disease, less than 50% of patients achieve lasting remission. In an effort to predict treatment success and explain disease heterogeneity clinical features have been employed for prognostic purposes, but have yielded only modest predictive performance. This has spawned a series of high profile microarray-based gene expression studies of DLBCL, in the hope that molecular level information could be used to refine prognosis. The intent of this paper is to reevaluate these microarray-based prognostic assessments, and extend the statistical methodology that has been used in this context.

Methodological challenges arise in using patients’ gene expression profiles to predict survival endpoints on account of the large number of genes and their complex interdependence....

Functional Empirical Bayes Methods for Identifying Genes with Different Time-course Expression Profiles

Wed, 10 Nov 2004 00:00:00 +0000

Time course studies of gene expression are essential in biomedical research to understand biological phenomena that evolve in a temporal fashion. Microarray technology makes it possible to study genome-wide temporal differences in gene expression profiles between different experimental conditions/groups. In this paper, we introduce a functional hierarchical model and empirical Bayes approach to model gene expression trajectories over time and to detect temporally differentially expressed (TDE) genes. Monte Carlo EM algorithm is developed for estimating both the gene-specific parameters and the hyperparameters. We use the posterior probability based false discovery rate (FDR) criterion to identify the TDE genes in order to control for the over FDR. We illustrate the methods by using both simulated data sets and a data set from a microarray based gene expression time course study of C. elegans developmental processes. Simulation results suggested that the procedure have low...

Machine Learning Benchmarks and Random Forest Regression

Wed, 14 Apr 2004 00:00:00 +0000

Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ‘random forests’. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about...

Penalized Cox Regression Analysis in the High-Dimensional and Low-sample Size Settings, with Applications to Mi-croarray Gene Expression Data

Fri, 19 Mar 2004 00:00:00 +0000

An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. Due to large variability in time to certain clinical event among patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as categorical variables. We propose to use the L1 penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can...

Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests

Thu, 11 Mar 2004 00:00:00 +0000

The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support...

Partial Cox Regression Analysis for High-Dimensional Microarray Gene Expression Data

Mon, 1 Mar 2004 00:00:00 +0000

An important application of microarray technology is to predict various clinical phenotypes based on the gene expression profile. Success has been demonstrated in molecular classification of cancer in which different types of cancer serve as categorical outcome variable. However, there has been less research in linking gene expression profile to censored survival outcome such as patients' overall survival time or time to cancer relapse. In this paper, we develop a partial Cox regression method for constructing mutually uncorrelated components based on microarray gene expression data for predicting the survival of future patients. The proposed partial Cox regression method involves constructing predictive components by repeated least square fitting of residuals and Cox regression fitting. The key difference from the standard principal components Cox regression analysis is that in constructing the predictive components, our method utilizes the observed survival/censoring information....

Dimension Reduction Methods for Microarrays with Application to Censored Survival Data

Tue, 24 Feb 2004 00:00:00 +0000

Recent research has shown that gene expression profiles can potentially be used for predicting phenotypes such as cancer types and survival time in biomedical research. Microarray technology which simultaneously measures expression values of thousands of genes provides a powerful tool as well as new challenges in relating gene expression profiles to phenotypes. Expression data are often very high-dimensional, which makes statistical modeling more difficult and complex, especially when the phenotypes such as time to death or cancer recurrence are subject to right censoring. We consider in this paper a model-free sufficient dimension reduction technique to reduce the dimension of microarray data in the context of analyzing censored survival data. We propose a dimension reduction technique which does not assume a particular model for survival time given gene expression values. After dimension reduction, the constructed gene expression components are used as covariates for predicting...

A Hidden Markov Modeling Approach for Admixture Mapping Based on Case-Control Haplotype Data

Thu, 11 Dec 2003 00:00:00 +0000

Admixture mapping is potentially a powerful method for mapping genes for complex human diseases, when the disease frequency due to a particular disease susceptible gene is different between founding populations of different ethnicity. The method tests for genetic linkage by detecting association of the allele ancestry with the disease. Since the markers used to define ancestral populations are not fully informative for the ancestry status, direct test of such association is not possible. In this paper, we develop a hidden Markov model (HMM) framework for estimating the unobserved ancestry haplotypes across a chromosomal region based on marker haplotypes. The HMM efficiently utilizes all the marker data to infer the latent ancestry states at the putative disease locus. In this modeling framework, we consider a likelihood based approach for detecting genetic linkage based on case-control data. We evaluate by simulations how several factors affect the power of admixture mapping,...

Ascertainment-Adjusted Maximum Likelihood Estimation for the Additive Genetic Gamma Frailty Model

Thu, 11 Dec 2003 00:00:00 +0000

The additive genetic gamma frailty model has been proposed for genetic linkage analysis for complex diseases to account for variable age of onset and possible covariates effects. To avoid ascertainment biases in parameter estimates, retrospective likelihood ratio tests are often used, which may result in loss of efficiency due to conditioning. This paper considers when the sibships are ascertained by having at least two affected sibs with the disease before a given age and provides two approaches for estimating the parameters in the additive gamma frailty model. One approach is based on the likelihood function conditioning on the ascertainment event, the other is based on maximizing a full ascertainment-adjusted likelihood. Explicit forms for these likelihood functions are derived. Simulation studies indicate that when the baseline hazard function can be correctly pre-specified, both approaches give accurate estimates of the model parameters. However, when the baseline hazard...

A note on the mating scheme used by the Mutagenesis Project

Wed, 28 May 2003 00:00:00 +0000

This is a short note on the mating scheme used by the Mutagenesis Project. We present a probabilistic analysis of the distribution of the number of mutants in the G3 generation. It is shown to be a function of the number of G2 mothers and litter sizes. A computer program is provided to make the calculation. We quantify the odds of a G2 mother being a mutation carrier given that none of its progeny are mutants. Finally we analyze some data from the project; we find the data to be consistent with theory.

An algorithm for detecting phenotypic mutants for the JAX neuroscience mutagenesis facility

Wed, 28 May 2003 00:00:00 +0000

The Mutagroup at Jackson Labs is interested in generating new mouse models for studying neurological disease by producing mutations in mice by injecting them with ENU. The group proposes to produce large numbers of potential mutants and screen them for phenotypic anomalies. In this report we propose a statistical algorithm to flag phenotypic deviants. We have applied the algorithm to a pilot data set collected by Dr. Kevin Seburn on mice placed in cages equipped with monitoring devices. Aiming for a 5% false positive rate, the algorithm was able to detect 18 of the 27 mutant mice it was presented.

Predicting Progress in Shotgun Sequencing with Paired Ends

Wed, 28 May 2003 00:00:00 +0000

Paired-end shotgun sequencing has become widely used for large-scale sequencing projects in recent years, including whole genome shot-gun sequencing and map-based BAC clone sequencing. Under this scheme, sequences from both ends of random clones are determined and assembled into sequence contigs. The sequence data and their linking information are used to construct clone maps in the form of scaffolds. In order to plan a cost-effective sequencing project utilizing such an approach, it is crucial to have knowledge of the expected project progress in relation to parameters such as insert size, clone lengths and redundancy. There has been a lack of theoretical analysis for the paired-end sequencing strategy due to the difficulty of correlated ends. Here we present a mathematical analysis for the progress of a sequencing project employing such a scheme. Formulae for various measures of the expected progress such as expected number and size of scaffolds are derived and assessed by...

Regression Approaches for Microarray Data Analysis

Wed, 26 Mar 2003 00:00:00 +0000

A variety of new procedures have been devised to handle the two sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available, and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual...

Clustering of translocation breakpoints

Wed, 26 Mar 2003 00:00:00 +0000

Translocation, a physical movement of genetic material from one chromosome to another, can aberrantly juxtapose portions of two cellular genes. This type of fusion may disrupt cellular function by producing novel, biologically-active fused genes, or by the activation of normally quiescent growth-associated genes. Either of these mechanisms provides a putative oncogenic stimulus and, indeed, several gene fusions from translocations have been identified in leukemias, lymphomas, and sarcomas. While the biological activity of the oncogenic effects of genes involved in translocations are under intensive study, little is known regarding the formation of translocation fusions themselves. The locations of these fusions are typically independent of the resultant oncogenic protein as long as they take place within certain bounded regions within the genes. Because of this independence a patterned, in particular clustered, distribution of fusion breakpoints within a given region will potentially...

Relating amino acid sequence to phenotype: Analysis of peptide binding data

Wed, 26 Mar 2003 00:00:00 +0000

We illustrate data analytic concerns that arise in the context of relating “genotype”, as represented by amino acid sequence, to phenotypes (outcomes). The present application examines whether peptides that bind to a particular major histocompatibility complex (MHC) class I molecule have characteristic amino acid sequences. However, the concerns identified and addressed are considerably more general. It is recognized that simple rules for predicting binding based solely on preferences for specific amino acids in certain (anchor) positions of the peptide’s amino acid sequence are generally inadequate and that binding is potentially influenced by all sequence positions as well as between-position interactions. The desire to elucidate these more complex prediction rules has spawned various modeling attempts, the shortcomings of which provide motivation for the methods adopted here. Because of (i) this need to model between-position interactions, (ii) amino acids constituting a...