Lian, Xiaochen

Mining Spatial and Spatio-Temporal ROIs for Action Recognition

2016

Lian, Xiaochen
Advisor(s): Yuille, Alan Loddon

Abstract

In this paper, we propose an approach to classify action sequences. We observe that in action sequences the critical features for discriminating between actions occur only within sub-regions of the image. Hence deep network approaches will address the entire image are at a disadvantage. This motivates our strategy which uses static and spatio-temporal visual cues to isolate static and spatio-temporal regions of interest (ROIs). We then use weakly supervised learning to train deep network classifiers using the ROIs as input. More specifically, we combine multiple instance learning (MIL) with convolutional neural networks (CNNs) to select discriminative action cues. This yields classifiers for static images, using the static ROIs, as well as classifiers for short image sequences (16 frames), using spatio-temporal ROIs. Extensive experiments performed on the UCF101 and HMDB51 benchmarks show that both these types of classifiers perform well individually and achieve state of the art performance when combined together. We also show qualitatively that our ROIs (selected by the algorithms) capture the most relevant parts of the image sequences.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Mining Spatial and Spatio-Temporal ROIs for Action Recognition