«Abstract. While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of ...»
Assessing the Quality of Actions
Hamed Pirsiavash, Carl Vondrick, Antonio Torralba
Massachusetts Institute of Technology
Abstract. While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem
of assessing how well people perform actions has been largely unexplored
in computer vision. Since methods for assessing action quality have many
real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is signiﬁcant opportunity in computer vision research to improve on this diﬃcult yet important task.
1 Introduction Recent advances in computer vision have provided reliable methods for recognizing actions in videos and images. However, the problem of automatically quantifying how well people perform actions has been largely unexplored.
We believe the computer vision community should begin to tackle the challenging problem of assessing the quality of people’s actions because there are many important, real-world applications. For example, in health care, patients are often monitored and evaluated after hospitalization as they perform daily tasks, which is expensive undertaking without an automatic assessment method.
In sports, action quality assessments would allow an athlete to practice in front of Fig. 1: We introduce a learning framework for assessing the quality of human actions from videos. Since we estimate a model for what constitutes a high quality action, our "Lower Feet" "Stretch Hands" method can also provide feedback on how people can improve their acQuality of Action: 86.5 / 100
a camera and receive quality scores in real-time, providing the athlete with rapid feedback and an opportunity to improve their action. In retrieval, a video search engine may want to sort results based on the quality of the action performed instead of only the relevance.
However, automatically assessing the quality of actions is not an easy computer vision problem. Human experts for a particular domain, such as coaches or doctors, have typically been trained over many years to develop complex underlying rules to assess action quality. If machines are to assess action quality, then they must discover similar rules as well.
In this paper, we propose a data-driven method to learn how to assess the quality of actions in videos. To our knowledge, we are the ﬁrst to propose a general framework for learning to assess the quality of human-based actions from videos. Our method works by extracting the spatio-temporal pose features of people, and with minimal annotation, estimating a regression model that predicts the scores of actions. Fig.1 shows an example output of our system.
In order to quantify the performance of our methods, we introduce a new dataset for action quality assessment comprised of Olympic sports footage. Although the methods in this paper are general, sports broadcast footage has the advantage that it is freely available, and comes already rigorously “annotated” by the Olympic judges. We evaluate our quality assessments on both diving and ﬁgure skating competitions. Our results are promising, and suggest that our method is signiﬁcantly better at ranking people’s actions by their quality than non-expert humans. However, our method is still a long way from rivaling the performance of expert judges, indicating that there is signiﬁcant opportunity in computer vision research to improve on this diﬃcult yet important task.
Moreover, since our method leverages high level pose features to learn a model for action quality, we can use this model to help machines understand people in videos as well. Firstly, we can provide interpretable feedback to performers on how to improve the quality of their action. The red vectors in Fig.1 are output from our system that instructs the Olympic diver to stretch his hands and lower his feet. Our feedback system works by calculating the gradient for each body joint against the learned model that would have maximized people’s scores. Secondly, we can create highlights of videos by ﬁnding which segments contributed the most to the action quality, complementing work in video summarization. We hypothesize that further progress in building better quality assessment models can improve both feedback systems and video highlights.
The three principal contributions of this paper revolve around automatically assessing the quality of people’s actions in videos. Firstly, we introduce a general learning-based framework for the quality assessment of human actions using spatiotemporal pose features. Secondly, we then describe a system to generate feedback for performers in order to improve their score. Finally, we release a new dataset for action quality assessment in the hopes of facilitating future research on this task. The remainder of this paper describes these contributions in detail.
Assessing the Quality of Actions 3 2 Related Work
This paper builds upon several areas of computer vision. We brieﬂy review related work:
Action Assessment: The problem of action quality assessment has been relatively unexplored in the computer vision community. There have been a few promising eﬀorts to judge how well people perform actions [1–3], however, these previous works have so far been hand-crafted for speciﬁc actions. The motivation for assessing peoples actions in healthcare applications has also been discussed before , but the technical method is limited to recognizing actions.
In this paper, we propose a generic learning-based framework with state-of-theart features for action quality assessment that can be applied to most types of human actions. To demonstrate this generality, we evaluate on two distinct types of actions (diving and ﬁgure skating). Furthermore, our system is able to generate interpretable feedback on how performers can improve their action.
Photograph Assessment: There are several works that assess photographs, such as their quality , interestingness  and aesthetics [7, 8]. In this work, we instead focus on assessing the quality of human actions, and not the quality of the video capture or its artistic aspects.
Action Recognition: There is a large body of work studying how to recognize actions in both images [9–13] and videos [14–18], and we refer readers to excellent surveys [19, 20] for a full review. While this paper also studies actions, we are interested in assessing their quality rather than recognizing them.
Features: There are many features for action recognition using spatiotemporal bag-of-words [21, 22], interest points , feature learning , and human pose based . However, so far these features have primarily been shown to work for recognition. We found that some of these features, notably  and  with minor adjustments, can be used for the quality assessment of actions too.
Video Summarization: This paper complements work in video summarization [26–31]. Rather than relying on saliency features or priors, we instead can summarize videos by discarding segments that did not impact the quality score of an action, thereby creating a “highlights reel” for the video.
3 Assessing Action Quality We now present our system for assessing the quality of an action from videos. On a high level, our model learns a regression model from spatio-temporal features.
After presenting our model, we then show how our model can be used to provide feedback to the people in videos to improve their actions. We ﬁnally describe how our model can highlight segments of the video that contribute the most to the quality score.
3.1 Features To learn a regression model to the action quality, we extract spatio-temporal features from videos. We consider two sets of features: low-level features that capPirsiavash, Vondrick, Torralba ture gradients and velocities directly from pixels, and high-level features based oﬀ the trajectory of human pose.
Low Level Features: Since there has been signiﬁcant progress in developing features for recognizing actions, we tried using them for assessing actions too.
We use a hierarchical feature  that obtains state-of-the-art performance in action recognition by learning a ﬁlter bank with independent subspace analysis.
The learned ﬁlter bank consists of spatio-temporal Gabor-like ﬁlters that capture edges and velocities. In our experiments, we use the implementation by  with the network pre-trained on the Hollywood2 dataset .
High Level Pose Features: Since most low-level features capture statistics from pixels directly, they are often diﬃcult to interpret. As we wish to provide feedback on how a performer can improve their actions, we want the feedback to be interpretable. Inspired by actionlets , we now present high level features based oﬀ human pose that are interpretable.
Given a video, we assume that we know the pose of the human performer in every frame, obtained either through ground truth or automatic pose estimation.
Let p(j) (t) be the x component of the jth joint in the tth frame of the video.
Since we want our features to be translation-invariant, we normalize the joint
positions relative to the head position:
q (j) (t) = p(j) (t) − p(0) (t)
where we have assumed that p(0) (t) refers to the head. Note that q (j) is a function of time, so we can represent it in the frequency domain by the discrete cosine transform (DCT): Q(j) = Aq (j) where A is the discrete cosine transformation matrix. We then use the k lowest frequency components to create the feature (j) vector φj = Q1:k where A1:k selects the ﬁrst k rows of A. We found that only using the low frequencies helps remove high frequency noise due to pose estimation errors. We use the absolute value of the frequency coeﬃcients Qi.
We compute φj for every joint for both the x- and y-components, and concatenate them to create the ﬁnal feature vector φ. We note that if the video is long, we break it up into segments and concatenate the features to produce one feature vector for the entire video. This inreases the temporal resolution of our features for long videos.
Actionlets  uses a similar method with Discrete Fourier Transform (DFT) instead. Although there is a close relationship between DFT and DCT, we see better results using DCT. We believe this is the case since DCT provides a more compact representation. Additionally, DCT coeﬃcients are real numbers instead of complex, so less information is lost in the absolute value operation.
In order to estimate the joints of the performer throughout the video p(j) (t), we run a pose estimation algorithm to ﬁnd the position of the joints in every frame. We estimate the pose using a Flexible Parts Model  for each frame independently. Since  ﬁnds the best pose for a single frame using dynamic Assessing the Quality of Actions 5 Fig. 2: Pose Estimation Challenges: Some results for human pose estimation on our action quality dataset. Since the performers contort their body in unusual conﬁgurations, pose estimation is very challenging on our dataset.
programming and we want the best pose across the entire video, we ﬁnd the N -best pose solutions per frame using . Then we associate the poses using a dynamic programming algorithm to ﬁnd the best track in the whole video. The association looks for the single best smooth track covering the whole temporal span of the video. Fig.2 shows some successes and failures of this pose estimation.