«Abstract. While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of ...»
We then pose quality assessment as a supervised regression problem. Let Φi ∈ Rk×n be the pose features for video i in matrix form where n is the number of joints and k is the number of low frequency components. We write yi ∈ R to denote the ground-truth quality score of the action in video i, obtained by an expert human judge. We then train a linear support vector regression (L-SVR)  to predict yi given features Φi over a training set. In our experiments, we use libsvm . Optimization is fast, and takes less than a second on typical sized problems. We perform cross validation to estimate hyperparameters.
Domain Knowledge: We note that a comprehensive model for quality assessment might use domain experts to annotate ﬁne-tuned knowledge on the action’s quality (e.g., “the leg must be straight”). However, relying on domain experts is expensive and diﬃcult to scale to a large number of actions. By posing quality assessment as a machine learning problem with minimal interaction from an expert, we can scale more eﬃciently. In our system, we only require a single real number per video corresponding to the score of the quality.
Prototypical Example: Moreover, a fairly simple method to assess quality is to check the observed video against a ground truth video with perfect execution, and then determine the diﬀerence. However, in practice, many actions can have multiple ideal executions (e.g., a perfect overhand serve might be just as good as a perfect underhand serve). Instead, our model can handle multi-modal score distributions.
6 Pirsiavash, Vondrick, Torralba
3.3 Feedback Proposals As a performer executes an action, in addition to assessing the quality, we also wish to provide feedback on how the performer can improve his action. Since our regression model operates over pose-based features, we can determine how the performer should move to maximize the score.
We accomplish this by diﬀerentiating the scoring function with respect to joint location. We calculate the gradient of the score with respect to the location of each joint ∂p∂S(t) where S is the scoring function. By calculating the maximum (j)
performer must move to most improve the score.1
3.4 Video Highlights In addition to ﬁnding the joint that will result in the largest score improvement, we also wish to measure the impact a segment of the video has on the quality score. Such a measure could be useful in summarizing the segments of actions that contribute to high or low scores.
We deﬁne a segment’s impact as how much the quality score would change if the segment were removed. In order to remove a segment, we compute the most likely feature vector had we not observed the missing segment. The key observation is that since we only use the low frequency components in our feature vector, there are more equations than unknowns when estimating the DCT coeﬃcients. Consequently, removing a segment corresponds to simply removing some equations.
Let B = A+ be the inverse cosine transform where A+ is the psuedo-inverse of A. Then, the DCT equation can be written as Q(j) = B + q (j). If the data from We do not diﬀerentiate with respect to the head location because it is used for normalization.
Assessing the Quality of Actions 7
Fig. 3: Interpolating Segments: This schematic shows how the displacement vector changes when a segment of the video is removed in order to compute impact. The dashed curve is the original displacement, and the solid curve is the most likely displacement given observations with a missing segment.
4 Experiments In this section, we evaluate both our quality assessment method and feedback system for quality improvement with quantitative experiments. Since quality assessment has not yet been extensively studied in the computer vision community, we ﬁrst introduce a new video dataset for action quality assessment.
4.1 Action Quality Dataset There are two primary hurdles in building a large dataset for action quality assessment. Firstly, the score annotations are subjective, and require an expert.
Unfortunately, hiring an expert to annotate hundreds of videos is expensive.
Secondly, in some applications such as health care, there are privacy and legal issues involved in collecting videos from patients. In order to establish a baseline dataset for further research, we desire freely available videos.
We introduce an Olympics video dataset for action quality assessment. Sports footage has the advantage that it can be obtained freely, and the expert judge’s scores are frequently released publicly. We collected videos from YouTube for two categories of sports, diving and ﬁgure skating, from recent Olympics and other worldwide championships. The videos are long with multiple instances of actions performed by multiple people. We annotated the videos with the start and end frame for each instance, and we extracted the judge’s score. The dataset will be publicly available.
8 Pirsiavash, Vondrick, Torralba Fig. 4: Diving Dataset: Some of the best dives from our diving dataset. Each column corresponds to one video. There is a large variation in the top-scoring actions. Hence, providing feedback is not as easy as pushing the action towards a canonical ”good” performance.
Fig. 5: Figure Skating Dataset: Sample frames from our ﬁgure skating dataset.
Notice the large variations of routines that the performers attempt. This makes automatic pose estimation challenging.
Diving: Fig.4 shows a few examples of our diving dataset. Our diving dataset consists of 159 videos. The videos are slow-motion from television broadcasting channels, so the eﬀective frame rate is 60 frames per second. Each video is about 150 frames, and the entire dataset consists of 25,000 frames. The ground truth judge scores varies between 20 (worst) and 100 (best). In our experiments, we use 100 instances for training and the rest for testing. We repeated every experiment 200 times with diﬀerent random splits and averaged the results. In addition to the Olympic judge’s score, we also consulted with the MIT varsity diving coach who annotated which joints a diver should adjust to improve each dive. We use this data to evaluate our feedback system for the quality improvement algorithm.
Figure Skating: Fig.5 shows some frames from our ﬁgure skating dataset.
This dataset contains 150 videos captured at 24 frames per second. Each video is almost 4,200 frames, and the entire dataset is 630,000 frames. The judge’s score ranges between 0 (worst) and 100 (best). We use 100 instances for training and the rest for testing. As before, we repeated every experiment 200 times with diﬀerent random splits and averaged the results. We note that our ﬁgure skating tends to be more challenging for pose estimation since it is at a lower frame rate, and has more variation in the human pose and clothing (e.g., wearing skirt).
Assessing the Quality of Actions 9
4.2 Quality Assessment We evaluate our quality assessment on both the ﬁgure skating and diving dataset.
In order to compare our results against the ground truth, we use the rank correlation of the scores we predict against the scores the Olympic judges awarded.
Tab.1 and Tab.2 show the mean performance over random train/test splits of our datasets. Our results suggest that pose-based features are competitive, and even obtain the best performance on the diving dataset. In addition, our results indicate that features learned to recognize actions can be used to assess the quality of actions too. We show some of the best and worst videos as predicted by our model in Fig.6.
We compare our quality assessment against several baselines. Firstly, we compare to both space-time interest points (STIP) and pose-based features with Discrete Fourier Transform (DFT) instead of DCT (similar to ). Both of these features performed worse. Secondly, we also compare to ridge regression with all feature sets. Our results show that support vector regression often obtains signiﬁcantly better performance.
We also asked non-expert human annotators to predict the quality of each diver in the diving dataset. Interestingly, after we instructed the subjects to read the Wikipedia page on diving, non-expert annotators were only able to achieve a rank correlation of 19%, which is half the performance of support vector regression with pose features. We believe this diﬀerence is evidence that our algorithm is starting to learn which human poses constitute good dives. We note, however, that our method is far from matching Olympic judges since they are able to predict the median judge’s score with a rank correlation of 96%, suggesting that there is still signiﬁcant room for improvement.2 Olympic diving competitions have two scores: the technical diﬃculty and the score.
The ﬁnal quality of the action is then the product of these two quantities. Judges are 10 Pirsiavash, Vondrick, Torralba Fig. 6: Examples of Diving Scores: We show the two best and worst videos sorted by the predicted score. Each column is one video with ground truth and predicted score written below. Notice that in the last place video, the diver lacked straight legs in the beginning and did not have a tight folding pose. These two pitfalls are part of common diving advice given by coaches, and our model has learned this independently.
While our system is able to predict the quality of actions with some success, it has many limitations. One of the major bottlenecks is the pose estimation. Fig.2 shows a few examples of the successes and failures of the pose estimation. Pose estimation in our datasets is very challenging since the performers contort their body in many unusual conﬁgurations with signiﬁcant variation in appearance.
The frequent occlusion by clothing for ﬁgure skating noticeably harms the pose estimation performance. When the pose estimation is poor, the quality score is strongly aﬀected, suggesting that advances in pose estimation or using depth sensors for pose can improve our system. Future work in action quality can be made robust against these types of failures as well by accounting for the uncertainty in the pose estimation.
Our system is designed to work for one human performer only, and does not model coordination between multiple people, which is often important for many types of sports and activities. We believe that future work in explicitly modeling team activities and interactions can signiﬁcantly advance action quality assessment. Moreover, we do not model objects used during actions (such as sports balls or tools), and we do not consider physical outcomes (such as splashes told the technical diﬃculty apriori, which gives them a slight competitive edge over our algorithms. We did not model the technical diﬃculty in the interest of building a general system.
Assessing the Quality of Actions 11 Fig. 7: Diving Feedback Proposals: We show feedback for some of the divers.
The red vectors are instructing the divers to move their body in the direction of the arrow. In general, the feedback instructs divers to tuck their legs more and straighten their body before entering the pool.
in diving), which may be important features for some activities. Finally, while our representation captures the movements of human joint locations, we do not explicitly model their synchronization (e.g., keeping legs together) or repetitions (e.g., waving hands back and forth). We suspect a stronger quality assessment model will factor in these visual elements.
4.4 Feedback for Improvement
In addition to quality assessment, we evaluate the feedback vectors that our method provides. Fig.7 and Fig.8 show qualitatively a sample of the feedback that our algorithm suggests. In general, the feedback is reasonable, often making modiﬁcations to the extremities of the performer.