Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Learning Toward Object Invariance Across Modalities

Abstract

Humans possess a remarkable ability to identify and pinpoint objects within complex environments, even from partial views or descriptions. For example, when we are asked to ``grab a green mug next to coffee machine", we can easily identify the green mug regardless of what view of the green mug is presented to us and distinguish it from other objects. This requires the model to comprehend the instruction, pick up the correct object, and be invariant with object views.

In this thesis, I start investigating object invariance from systematically analyzing what are the challenging views of an object such that the existing models fail to recognize the object. We show that existing models cannot recognize the object from challenging object views. To address this problem, I proposed a training method ``PIE" to encourage the model to learn a structured feature space that is view-invariant. However, PIE learns the structured feature space in supervised manner, which requires annotation from human. To overcome this limitation, I further proposed VISPE to learn the view-invariant feature in a self-supervised manner for multiview classification and retrieval tasks.

In addition to learning invariance with images, the use of language as an abstraction of invariant object representation is investigated. I first discover that the existing visual language models fails to produce a consistent prediction when the object class name changes to different granularity (e.g. from Bengal tiger to tiger). A novel prompt tuning methods is proposed to regularize model prediction and to better align language and vision.However, it remains a question whether these visual language models can realize the existence of an object. This is investigated by curating a visual question answering dataset that contains ``unanswerable questions", where the referred object is not in the image. An unsupervised approach is proposed to encourage the model to be aware of unanswerable questions. These efforts have improved the understanding of how to learn object invariance across modalities.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View