Predicting the Interestingness of Videos

A Computational Approach



Figure 1: Two example clips. The left one is considered more interesting.

Overview

The amount of videos available on the Web is growing explosively. While some videos are very interesting and receive high rating from viewers, many of them are less interesting or even boring. The measure of interestingness of videos can be used to improve user satisfaction in many applications. For example, in Web video search, for the videos with similar relevancy to a query, it would be good to rank the more interesting ones higher.

Here we conduct a pilot study on the understanding of human perception of video interestingness for the first time, and design a computational method to identify more interesting videos. To this end we first construct two datasets of Flickr and YouTube videos respectively. Human judgements of interestingness are collected and used as the ground-truth for training computational models. We evaluate several off-the-shelf visual and audio features that are potentially useful for predicting interestingness on both datasets. 


Datasets

To support the study we build two benchmark datasets with ground-truth interestingness labels. The first dataset (1,200 videos) was collected from Flickr, which has a criterion called "interestingness" to rank its search results. The second dataset (420 videos) was collected from YouTube, which does not have similar ranking criteria. So we build an online pair-wise annotation system, and hire 10 human assessors to provide intestingness ratings of the videos. To simulate further research on this challenging problem, the datasets have been released.


Computational Approach


We designed and implemented a computational system to compare the interestingness levels of videos, using a large variety of features ranging from visual, audio, to high-level attributes, such as visual SIFT, audio MFCC, and attribute ObjectBank. Given two videos, the computational system is able to automatically predict which one is more interesting. The prediction framework and some representative results are shown in the following figure. Overall, we observed very promising results on both datasets. For more details, please refer to our AAAI 2013 paper.

     

Figure 2: The prediction framework of our computational system (Left), and a subset of the prediction results (Right). Visual, audio, and attribute-based features are all useful, and the combination of multimodal features can lead to further improvements. Also we got some very interesting findings. See more details in the paper.