Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Optimal Visual Representation Engineering and Learning for Computer Vision

Abstract

Estimating the optimal representation from sensor data has been one of the most challenging problems in computer vision research. Given a particular task, an optimal representation should contain the right information for answering queries related to the task. To be specific, such a representation should be a sufficient statistics of the data that is invariant to nuisance factors irrelevant to the task yet affecting the data. Among all the sufficient statistics, we desire the minimal that costs the least in terms of complexity. In terms of invariance, we want to achieve the maximal so that nuisance will not affect the inference at test time.

In the first part of the dissertation, we show that it is possible to build such an optimal local descriptor that is a minimal sufficient statistic of the data and is maximally invariant to certain nuisance variables in the problem of establishing feature correspondence. Given only one single image, such nuisance group is quite restricted as a single view does not afford the ability to distinguish the intrinsic properties of the scene from the extrinsics. This restriction is lifted once multiple views of the same underlying scene become available. A theoretical framework is proposed to compute an optimal multiple-view local representation with view-point change-induced domain deformation marginalized. In the second part, we investigate the nuisance management ability of deep neural networks in the context of image classification and show that an explicit sampling-based marginalization technique can improve its performance significantly. This is in line with the principle developed in the previous part. Finally, we build a real-time system to estimate a visual-inertial-semantic representation of the 3D scene from both imaging and inertial measurements. Evidence from the imaging and inertial measurements are causally aggregated into the final estimate in a Bayesian filtering framework. The geometric and semantic properties of the scene do not depend on the pose and motion of the camera, and are persistent over time.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View