HE, TONG

Deep 3D Embodied Visual Recognition

2021

HE, TONG
Advisor(s): Soatto, Stefano SS

Abstract

Deep 3D embodied visual recognition refers to the scenarios where an autonomous agent needs to achieve some designated tasks (driving, searching, grasping, etc.) in a 3D environment. To achieve the task at hand, the agent must have good geometric and semantic understandings of the 3D objects/scenes around it, such as the intention of a pedestrian and the pose/shape of an obstacle. In this thesis, I focus on driving-related problems. Methods for developing a self-driving agent can be roughly classified into two types: multi-stage and end-to-end models. 1) A typical multi-stage pipeline usually involves a series of steps such as perception, planning, and control. My works are mostly related to the perception module, which includes studying diverse 3D data representations (e.g., 3D bounding boxes, morphable skeleton graphs, raw point clouds, parametric meshes, deep voxels, and implicit surface functions), and supervised/self-supervised 3D representation learning strategies. Different 3D representations and learning strategies have trade-offs among the requirements of the tasks, the availability of the data annotation and the modeling tools. 2) The other thread of my research is aligned with the end-to-end learned visual driving models, which utilize a single deep network to process the sensor data (e.g., color images, LiDAR points) and estimate the low-level controls such as brake, throttle, and steering angle. The goal is to learn an autonomous agent that can generalize to various driving conditions, maps, and weather by imitation learning.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Deep 3D Embodied Visual Recognition