Discriminative Video Pattern Search for

Efficient Action Detection

Junsong Yuan, Zicheng Liu and Ying Wu

 

abstract

Actions are spatio-temporal patterns. Similar to the sliding window-based object detection, action detection is to find the re-occurrences of such spatio-temporal patterns through pattern matching. We address two critical issues in pattern matching-based action detection: (1) the tolerance of intra-pattern variations of actions, such as performing speed and scale variations, and (2) the efficiency of action pattern search in videos. First, we propose a discriminative pattern matching criterion for multi-class action categorization, called naive-Bayes based mutual information maximization (NBMIM). Each action is characterized by a collection of spatio-temporal invariant features and the we measure mutual information toward different action classes. For efficient action detection, a novel branch-and-bound search algorithm is proposed to locate the optimal subvolume in the volumetric video space. This proposed method is purely data-driven and does not rely on human detection, tracking or background subtraction. It can well handle the intra-pattern variations in actions such as the scale and speed variations, and is insensitive to dynamic and clutter backgrounds and even partial occlusions. The experiments of action categorization on the standard KTH dataset improves the state-of-the-art results. The cross-dataset experiments on action detection, including KTH, CMU action datasets, and another multi-action dataset, demonstrate the effectiveness and efficiency of the proposed detection method.

 

 

 

The following MSR action dataset used for the CVPR 09 paper is available for noncommercial research use. Here is the license agreement.

 

MSR action dataset          Ground truth file          Code (coming soon!)               

 

If you use this dataset, please cite the following paper:

Junsong Yuan, Zicheng Liu and Ying Wu, Discriminative Subvolume Search for Efficient Action Detection.  IEEE Conf. on Computer Vision and Pattern Recognition, 2009

 

 

Dataset description:

The test dataset contains 16 video sequences and has in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes.  All of the video sequences are captured with clutter and moving backgrounds. Each video is of low resolution 320 x 240 and frame rate 15 frames per second. Their lengths are between 32 to 76 seconds. To evaluate the performance, we manually label a spatio-temporal bounding box for each action. The ground truth labeling can be found in the groundtruth.txt file. The ground truth format of each labeled action is "X width Y height T length".

 

 

Sample results: