Exploring Aesthetics in Videos


Automatic understanding of viewer's impressions from image or video sequences is a very challenging task, but an interesting theme for study. In this paper, we demonstrate a computational approach to automatically evaluate the aesthetics of videos, with particular emphasis on identifying beautiful scenes. Using a standard classi cation pipeline, we analyze the effectiveness of a comprehensive set of features, ranging from low-level visual features, mid-level semantic attributes, to style descriptors. In addition, since there is limited public training data with manual labels of video aesthetics, we explore freely available resources with a simple assumption that people tend to share more aesthetically appealing works than unappealing ones. Our extensive evaluations show that combining multiple features is helpful, and very promising results can be obtained using the noisy but annotation-free training data. On the NHK Multimedia Challenge dataset, we attain a Spearman's rank correlation coefficient of 0.41.

Computational Approach

One of the reasons that very few studies have been conducted on video aesthetics is that limited training data with manual labels are publicly available. We therefore propose to construct two annotation-free training datasets by assuming that images/videos on DPChallenge and Flickr are mostly beautiful, particularly those highly rated ones, while the old material Dutch documentary videos are much less pleasing. Our dataset consist of 60,000 images and 3400 videos in total as is shown in Figure 1.

We adopt a standard χ2 kernel SVM classifier to predict aesthetic quality due to its overwhelming performances in many applications. Through building prediction models, we analyze the effectiveness of a large number of features, such as visual Color Histogram, attribute Classemes, motion Dense Trajectory, and Style Descriptor. Some selected features have been shown effective in predicting video interestingness [AAAI'13] and some are chosen particularly for this task based on our intuitions. Figure 2 shows our framework and a subset of the prediction results.

Figure 2. The framework of our system for evaluating the aesthetic quality of videos (Left), and some representative results (Right). The combination of Color Histrogram and Classemes achieves the best performance while some features like Dense Trajectory, which is very powerful in human action recognition, performs poorly. See more details in the paper.