Open Access Policy Deposits

This series is automatically populated with publications deposited by UCLA Fielding School of Public Health Department of Biostatistics researchers in accordance with the University of California’s open access policies. For more information see Open Access Policy Deposits and the UC Publication Management System.

Sort By:

Show:

Applications of nature-inspired metaheuristic algorithms for tackling optimization problems across disciplines.

(2024)

Nature-inspired metaheuristic algorithms are important components of artificial intelligence, and are increasingly used across disciplines to tackle various types of challenging optimization problems. This paper demonstrates the usefulness of such algorithms for solving a variety of challenging optimization problems in statistics using a nature-inspired metaheuristic algorithm called competitive swarm optimizer with mutated agents (CSO-MA). This algorithm was proposed by one of the authors and its superior performance relative to many of its competitors had been demonstrated in earlier work and again in this paper. The main goal of this paper is to show a typical nature-inspired metaheuristic algorithmi, like CSO-MA, is efficient for tackling many different types of optimization problems in statistics. Our applications are new and include finding maximum likelihood estimates of parameters in a single cell generalized trend model to study pseudotime in bioinformatics, estimating parameters in the commonly used Rasch model in education research, finding M-estimates for a Cox regression in a Markov renewal model, performing matrix completion tasks to impute missing data for a two compartment model, and selecting variables optimally in an ecology problem in China. To further demonstrate the flexibility of metaheuristics, we also find an optimal design for a car refueling experiment in the auto industry using a logistic model with multiple interacting factors. In addition, we show that metaheuristics can sometimes outperform optimization algorithms commonly used in statistics.

The genomic evolutionary dynamics and global circulation patterns of respiratory syncytial virus.

(2024)

Respiratory syncytial virus (RSV) is a leading cause of acute lower respiratory tract infection in young children and the second leading cause of infant death worldwide. While global circulation has been extensively studied for respiratory viruses such as seasonal influenza, and more recently also in great detail for SARS-CoV-2, a lack of global multi-annual sampling of complete RSV genomes limits our understanding of RSV molecular epidemiology. Here, we capitalise on the genomic surveillance by the INFORM-RSV study and apply phylodynamic approaches to uncover how selection and neutral epidemiological processes shape RSV diversity. Using complete viral genome sequences, we show similar patterns of site-specific diversifying selection among RSVA and RSVB and recover the imprint of non-neutral epidemic processes on their genealogies. Using a phylogeographic approach, we provide evidence for air travel governing the global patterns of RSVA and RSVB spread, which results in a considerable degree of phylogenetic mixing across countries. Our findings highlight the potential of systematic global RSV genomic surveillance for transforming our understanding of global RSV spread.

Cover page of Scalable gradients enable Hamiltonian Monte Carlo sampling for phylodynamic inference under episodic birth-death-sampling models.

Scalable gradients enable Hamiltonian Monte Carlo sampling for phylodynamic inference under episodic birth-death-sampling models.

(2024)

Birth-death models play a key role in phylodynamic analysis for their interpretation in terms of key epidemiological parameters. In particular, models with piecewise-constant rates varying at different epochs in time, to which we refer as episodic birth-death-sampling (EBDS) models, are valuable for their reflection of changing transmission dynamics over time. A challenge, however, that persists with current time-varying model inference procedures is their lack of computational efficiency. This limitation hinders the full utilization of these models in large-scale phylodynamic analyses, especially when dealing with high-dimensional parameter vectors that exhibit strong correlations. We present here a linear-time algorithm to compute the gradient of the birth-death model sampling density with respect to all time-varying parameters, and we implement this algorithm within a gradient-based Hamiltonian Monte Carlo (HMC) sampler to alleviate the computational burden of conducting inference under a wide variety of structures of, as well as priors for, EBDS processes. We assess this approach using three different real world data examples, including the HIV epidemic in Odesa, Ukraine, seasonal influenza A/H3N2 virus dynamics in New York state, America, and Ebola outbreak in West Africa. HMC sampling exhibits a substantial efficiency boost, delivering a 10- to 200-fold increase in minimum effective sample size per unit-time, in comparison to a Metropolis-Hastings-based approach. Additionally, we show the robustness of our implementation in both allowing for flexible prior choices and in modeling the transmission dynamics of various pathogens by accurately capturing the changing trend of viral effective reproductive number.

Interaction models matter: an efficient, flexible computational framework for model-specific investigation of epistasis.

(2024)

PURPOSE: Epistasis, the interaction between two or more genes, is integral to the study of genetics and is present throughout nature. Yet, it is seldom fully explored as most approaches primarily focus on single-locus effects, partly because analyzing all pairwise and higher-order interactions requires significant computational resources. Furthermore, existing methods for epistasis detection only consider a Cartesian (multiplicative) model for interaction terms. This is likely limiting as epistatic interactions can evolve to produce varied relationships between genetic loci, some complex and not linearly separable. METHODS: We present new algorithms for the interaction coefficients for standard regression models for epistasis that permit many varied models for the interaction terms for loci and efficient memory usage. The algorithms are given for two-way and three-way epistasis and may be generalized to higher order epistasis. Statistical tests for the interaction coefficients are also provided. We also present an efficient matrix based algorithm for permutation testing for two-way epistasis. We offer a proof and experimental evidence that methods that look for epistasis only at loci that have main effects may not be justified. Given the computational efficiency of the algorithm, we applied the method to a rat data set and mouse data set, with at least 10,000 loci and 1,000 samples each, using the standard Cartesian model and the XOR model to explore body mass index. RESULTS: This study reveals that although many of the loci found to exhibit significant statistical epistasis overlap between models in rats, the pairs are mostly distinct. Further, the XOR model found greater evidence for statistical epistasis in many more pairs of loci in both data sets with almost all significant epistasis in mice identified using XOR. In the rat data set, loci involved in epistasis under the XOR model are enriched for biologically relevant pathways. CONCLUSION: Our results in both species show that many biologically relevant epistatic relationships would have been undetected if only one interaction model was applied, providing evidence that varied interaction models should be implemented to explore epistatic interactions that occur in living systems.

Cover page of Jiayi Li, Yuantong Li and Xiaowu Dai's contribution to the Discussion of ‘Estimating means of bounded random variables by betting' by Waudby-Smith and Ramdas

Jiayi Li, Yuantong Li and Xiaowu Dai's contribution to the Discussion of ‘Estimating means of bounded random variables by betting' by Waudby-Smith and Ramdas

(2024)

Many-core algorithms for high-dimensional gradients on phylogenetic trees.

(2024)

MOTIVATION: Advancements in high-throughput genomic sequencing are delivering genomic pathogen data at an unprecedented rate, positioning statistical phylogenetics as a critical tool to monitor infectious diseases globally. This rapid growth spurs the need for efficient inference techniques, such as Hamiltonian Monte Carlo (HMC) in a Bayesian framework, to estimate parameters of these phylogenetic models where the dimensions of the parameters increase with the number of sequences N. HMC requires repeated calculation of the gradient of the data log-likelihood with respect to (wrt) all branch-length-specific (BLS) parameters that traditionally takes O(N2) operations using the standard pruning algorithm. A recent study proposes an approach to calculate this gradient in O(N), enabling researchers to take advantage of gradient-based samplers such as HMC. The CPU implementation of this approach makes the calculation of the gradient computationally tractable for nucleotide-based models but falls short in performance for larger state-space size models, such as Markov-modulated and codon models. Here, we describe novel massively parallel algorithms to calculate the gradient of the log-likelihood wrt all BLS parameters that take advantage of graphics processing units (GPUs) and result in many fold higher speedups over previous CPU implementations. RESULTS: We benchmark these GPU algorithms on three computing systems using three evolutionary inference examples exploring complete genomes from 997 dengue viruses, 62 carnivore mitochondria and 49 yeasts, and observe a >128-fold speedup over the CPU implementation for codon-based models and >8-fold speedup for nucleotide-based models. As a practical demonstration, we also estimate the timing of the first introduction of West Nile virus into the continental Unites States under a codon model with a relaxed molecular clock from 104 full viral genomes, an inference task previously intractable. AVAILABILITY AND IMPLEMENTATION: We provide an implementation of our GPU algorithms in BEAGLE v4.0.0 (https://github.com/beagle-dev/beagle-lib), an open-source library for statistical phylogenetics that enables parallel calculations on multi-core CPUs and GPUs. We employ a BEAGLE-implementation using the Bayesian phylogenetics framework BEAST (https://github.com/beast-dev/beast-mcmc).

A Bayesian Hierarchical Spatial Longitudinal Model Improves Estimation of Local Macular Rates of Change in Glaucomatous Eyes.

(2024)

PURPOSE: Demonstrate that a novel Bayesian hierarchical spatial longitudinal (HSL) model improves estimation of local macular ganglion cell complex (GCC) rates of change compared to simple linear regression (SLR) and a conditional autoregressive (CAR) model. METHODS: We analyzed GCC thickness measurements within 49 macular superpixels in 111 eyes (111 patients) with four or more macular optical coherence tomography scans and two or more years of follow-up. We compared superpixel-patient-specific estimates and their posterior variances derived from the latest version of a recently developed Bayesian HSL model, CAR, and SLR. We performed a simulation study to compare the accuracy of intercept and slope estimates in individual superpixels. RESULTS: HSL identified a significantly higher proportion of significant negative slopes in 13/49 superpixels and a significantly lower proportion of significant positive slopes in 21/49 superpixels than SLR. In the simulation study, the median (tenth, ninetieth percentile) ratio of mean squared error of SLR [CAR] over HSL for intercepts and slopes were 1.91 (1.23, 2.75) [1.51 (1.05, 2.20)] and 3.25 (1.40, 10.14) [2.36 (1.17, 5.56)], respectively. CONCLUSIONS: A novel Bayesian HSL model improves estimation accuracy of patient-specific local GCC rates of change. The proposed model is more than twice as efficient as SLR for estimating superpixel-patient slopes and identifies a higher proportion of deteriorating superpixels than SLR while minimizing false-positive detection rates. TRANSLATIONAL RELEVANCE: The proposed HSL model can be used to model macular structural measurements to detect individual glaucoma progression earlier and more efficiently in clinical and research settings.

Efficacy of Smoothing Algorithms to Enhance Detection of Visual Field Progression in Glaucoma.

(2024)

PURPOSE: To evaluate and compare the effectiveness of nearest neighbor (NN)- and variational autoencoder (VAE)-smoothing algorithms to reduce variability and enhance the performance of glaucoma visual field (VF) progression models. DESIGN: Longitudinal cohort study. SUBJECTS: 7150 eyes (4232 patients), with ≥ 5 years of follow-up and ≥ 6 visits. METHODS: Vsual field thresholds were smoothed with the NN and VAE algorithms. The mean total deviation (mTD) and VF index rates, pointwise linear regression (PLR), permutation of PLR (PoPLR), and the glaucoma rate index were applied to the unsmoothed and smoothed data. MAIN OUTCOME MEASURES: The proportion of progressing eyes and the conversion to progression were compared between the smoothed and unsmoothed data. A simulation series of noiseless VFs with various patterns of glaucoma damage was used to evaluate the specificity of the smoothing models. RESULTS: The mean values of age and follow-up time were 62.8 (standard deviation: 12.6) years and 10.4 (standard deviation: 4.7) years, respectively. The proportion of progression was significantly higher for the NN and VAE smoothed data compared with the unsmoothed data. VF progression occurred significantly earlier with both smoothed data compared with unsmoothed data based on mTD rates, PLR, and PoPLR methods. The ability to detect the progressing eyes was similar for the unsmoothed and smoothed data in the simulation data. CONCLUSIONS: Smoothing VF data with NN and VAE algorithms improves the signal-to-noise ratio for detection of change, results in earlier detection of VF progression, and could help monitor glaucoma progression more effectively in the clinical setting. FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters

(2024)

Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.

Post-Regularization Confidence Bands for Ordinary Differential Equations

(2024)