Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Gene Selection Methods for Single-cell Sequencing Data Analysis

Abstract

Since the advent of single-cell RNA sequencing (scRNA-seq) technologies around 15 years ago, they have become a powerful tool to characterize cell-to-cell heterogeneity within a cell population in various biological systems, and have revolutionized transcriptomic studies. A typical scRNA-seq dataset contains thousands to tens of thousands of genes; however, a subset of genes are usually sufficient for representing the underlying biological variations of cells that are aligned with researchers’ various interest. The sufficiency can be explained by three reasons: (1) highlighting and enhancing biological signals, (2) improving the interpretability of analysis results, and (3) reducing the number of genes to save computational or human resources. Hence, a number of gene selection methods have been performed in various tasks, for instance, informative gene selection for cell clustering and post-clustering differentially expressed gene identification for cell type annotation. However, existing efforts have not fully addressed the problems: among the genes selected by the existing methods, many are irrelevant, redundant, or insignificant. Gene selection for certain single-cell analysis tasks with biological meaningful interpretation and statistical rigor remains challenging. This dissertation aims to address them in two projects.

My first project focuses on the informative gene selection in general scRNA-seq data analysis, and extends an application to guide targeted gene profiling design. Unlike scRNA-seq, targeted gene profiling has a strong requirement for a limited number (often no more than hundreds) of genes to be specified before sequencing. In Chapter 2, we propose the single-cell Projective Non-negative Matrix Factorization (scPNMF) method, which leverages the PNMF algorithm and adds a unique feature of basis selection. scPNMF outperforms existing informative gene selection methods in that its selected, limited number of genes better distinguish cell types, and it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data.

My second project discusses post-clustering differentially expressed (DE) gene identification for cell-type annotation tasks. Here the selected genes serve as potential cell-type marker genes, by matching with the canonical ones, they are crucial in determining the cell types in single-cell sequencing data. Despite the popularity of the typical two-step analysis workflow: first, clustering; second, finding DE genes between cell clusters, an issue known as ”double dipping”–the same data is used twice to define cell clusters and find DE genes–exists here and leads to false-positive cell-type marker genes when the cell clusters are spurious. To overcome this challenge, in Chapter 3, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View