Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Robust Semiparametric Regression Estimation Using Targeted Maximum Likelihood with Application to Biomarker Discovery and Epidemiology

Abstract

In many scientific studies the goal is to determine the effect of a particular feature or variable on a given outcome in order to help understand, identify, and quantify the driving factors behind a particular phenomena. This type of analysis is commonly referred to as variable importance analysis. Parametric methods used to estimate these effects are prone to bias. This bias is often the result of incorrect model specification and improper inference for the parameter of interest. Alternative machine learning techniques, such as Random Forest, often result in abstract measures of importance whose inference depends on a computationally intensive bootstrap analysis. In this thesis, robust estimators for variable importance based on targeted maximum likelihood methodology are presented and developed for three types of outcomes (1) univariate continuous, (2) multivariate continuous, and (3) binary outcome. These estimators are specifically designed to target the effect of a variable of interest on an outcome while adjusting for confounders when the variable of interest is of general form (i.e. continuous or discrete). When the outcome is continuous (1,2), the effect is on an additive scale. When the outcome is binary (3), the effect is on a multiplicative scale, and the importance measure is a relative risk. The estimators are developed under a flexible semiparametric model, in which only components related to the variable of interest must be fully specified, and effect modification can be easily incorporated. Based on targeted maximum likelihood theory, the presented estimators are double robust and locally efficient, and correct inference for the parameter of interest is available using the corresponding influence curve.

In this thesis, the three estimators relating to the three outcomes are derived from targeted maximum likelihood methodology and implemented by adapting standard statistical regression software. These estimators are applied in both simulation and application. In a simulated biomarker discovery analysis, the robustness of the estimator for a univariate continuous outcome is compared to other common methods of variable importance under increasing correlation among the covariates. In a repeated measures setting, the double robust property of the estimator for a multivariate continuous outcome is demonstrated in simulation, and the estimator is applied in a transcription factor analysis to determine the activity level of transcription factors during the cell cycle in yeast. For a binary outcome, the estimator for the relative risk is applied to estimate the effect of HIV genetic susceptibility scores on viral response. Effect modification is also explored and model selection methodology is introduced.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View