«Abstract. While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of ...»
In order to quantitatively evaluate our feedback method, we needed to acquire ground truth annotations. We consulted with the MIT diving team coach who watched a subset of the videos in our dataset (27 in total) and provided suggestions on how to improve the dive. The diving coach gave us speciﬁc feedback (such as “move left foot down”) as well as high-level feedback (e.g., “legs should be straight here” or “tuck arms more”). We translated each feedback from the coach into one of three classes, referring to whether the diver should adjust his upper body, his lower body, or maintain the same pose on each frame. Due to the subjective nature of the task, the diving coach was not able to provide more detailed feedback annotations. Hence, the feedback is coarsely mapped into these three classes.
We then evaluate our feedback as a detection problem. We consider a feedback proposal from our algorithm as correct if it suggests to move a body part within a one second range of the coach making the same suggestion. We use the magnitude of the feedback gradient as the importance of the feedback proposal.
12 Pirsiavash, Vondrick, Torralba Fig. 8: Figure Skating Feedback Proposals: We show feedback for some of the ﬁgure skaters where the red vectors are instructions for the ﬁgure skaters.
Fig. 9: Feedback Limitations: The feedback we generate is not perfect. If the ﬁgure skater or diver were to rely completely on the feedback above, they may fall over. Our model does not factor in physical laws, motivating work in support inference [37, 38].
We use a leave-one-out approach where we predict feedback on a video heldout from training. Our feedback proposals obtain 53.18% AP overall for diving, compared to 27% AP chance level. We compute chance by randomly generating feedback that uniformly chooses between the upper body and lower body.
Since our action quality assessment model is not aware of physical laws, the feedback suggestions can be physically implausible. Fig.9 shows a few cases where if the performer listened to our feedback, they might fall over. Our method’s lack of physical models motivates work in support inference [37, 38].
Interestingly, by averaging the feedback across all divers in our dataset, we can ﬁnd the most common feedback produced by our model. Fig.10 shows the magnitude of feedback for each frame and each joint averaged over all divers. For visualization proposes, we warp all videos to have the same length. Most of the feedback suggests correcting the feet and hands, and the most important frames turn out to be the initial jump oﬀ the diving board, the zenith of the dive, and the moment right before the diver enters the water.
Assessing the Quality of Actions 13
Fig. 10: Visualizing Common Feedback: We visualize the average feedback magnitude across the entire diving dataset for each joint and frame. Red means high feedback and blue means low feedback. The top and right edges show marginals over frames and joints respectively. R and L stand for right and left respectively, and U and D stand for upper and lower body, respectively. Feet are the most common area for feedback on Olympic divers, and that the beginning and end of the dive are the most important time points.
4.5 Highlighting Impact We qualitatively analyze the video highlights produced by ﬁnding the segments that contributed the most to the ﬁnal quality score. We believe that this measure can be useful for video summarization since it reveals, out of a long video, which clips are the most important for the action quality. We computed impact on a routine from the ﬁgure skating dataset in Fig.11. Notice when the impact is near zero, the ﬁgure skater is in a standard, up-right position, or in-between maneuvers. The points of maximum impact correspond to jumps and twists of the ﬁgure skater, which contributes positively to the score if the skater performs it correctly, and negatively otherwise.
4.6 Discussion If quality assessment is a subjective task, is it reasonable for a machine to still obtain reasonable results? Remarkably, the independent Olympic judges agree with each other 96% of the time, which suggests that there is some underlying structure in the data. One hypothesis to explain this correlation is that the judges are following a complex system of rules to gauge the score. If so, then the job of a machine quality assessment system is to extract these rules. While the approach in this paper attempts to learn these rules, we are still a long way from high performance on this task.
5 ConclusionsAssessing the quality of actions is an important problem with many real-world applications in health care, sports and search. To enable these applications, we have introduced a general learning-based framework to automatically assess an 14 Pirsiavash, Vondrick, Torralba (a) (b) Fig. 11: Video Highlights: By calculating the impact each frame has on the score of the video, we can summarize long videos with the segments that have the largest impact on the quality score. Notice how, above, when the impact is close to zero, the skater is usually in an upright standard position, and when the impact is large, the skater is performing a maneuver.
action’s quality from videos as well as to provide feedback for how the performer can improve. We evaluated our system on a dataset of Olympic divers and ﬁgure skaters, and we show that our approach is signiﬁcantly better at assessing an action’s quality than a non-expert human. Although the quality of an action is a subjective measure, the independent Olympic judges have a large correlation.
This implies that there is a well deﬁned underlying rule that a computer vision system should be able to learn from data. Our hope is that this paper will motivate more work in this relatively unexplored area.
Assessing the Quality of Actions 15 Acknowledgments: We thank Zoya Bylinkskii and Sudeep Pillai for comments and the MIT diving team for their helpful feedback. Funding was provided by a NSF GRFP to CV and a Google research award and ONR MURI N000141010933 to AT.
1. Gordon, A.S.: Automated video assessment of human performance. In: AI-ED.
2. Jug, M., Perˇ, J., Deˇman, B., Kovaˇiˇ, S.: Trajectory based assessment of coors z cc dinated human activity. Springer (2003)
3. Perˇe, M., Kristan, M., Perˇ, J., Kovacic, S.: Automatic Evaluation of Organized s s Basketball Activity using Bayesian Networks. Citeseer (2007)
4. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in ﬁrst-person camera views. In: CVPR. (2012)
5. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: CVPR. (2006)
6. Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Van Gool, L.: The interestingness of images. (2013)
7. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: ECCV. (2006)
8. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics and interestingness. In: CVPR. (2011)
9. Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. PAMI (2009)
10. Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5d graph matching.
In: ECCV. (2012)
11. Yang, W., Wang, Y., Mori, G.: Recognizing human actions from still images with latent poses. In: CVPR. (2010)
12. Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: CVPR. (2011)
13. Delaitre, V., Sivic, J., Laptev, I., et al.: Learning person-object interactions for action recognition in still images. In: NIPS. (2011)
14. Laptev, I., Perez, P.: Retrieving actions in movies. In: ICCV. (2007)
15. Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR. (2012)
16. Rodriguez, M., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height ﬁlter for action recognition. In: CVPR. (2008) 1–8
17. Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In:
18. Shechtman, E., Irani, M.: Space-time behavior based correlation. In: IEEE PAMI.
19. Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6) (2010) 976–990
20. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput.
21. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC. (2009) 16 Pirsiavash, Vondrick, Torralba
22. Niebles, J., Chen, C., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classiﬁcation. ECCV (2010)
23. Laptev, I.: On space-time interest points. ICCV (2005)
24. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In:
25. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR. (2012)
26. Ekin, A., Tekalp, A.M., Mehrotra, R.: Automatic soccer video analysis and summarization. Transactions on Image Processing (2003)
27. Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR. (2013)
28. Gong, Y., Liu, X.: Video summarization using singular value decomposition. In:
29. Rav-Acha, A., Pritch, Y., Peleg, S.: Making a long video short: Dynamic video synopsis. In: CVPR. (2006)
30. Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. Circuits and Systems for Video Technology (2005)
31. Jiang, R.M., Sadka, A.H., Crookes, D.: Hierarchical video summarization in reference subspace. Consumer Electronics, IEEE Transactions on (2009)
32. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR. (2009)
33. Yang, Y., Ramanan, D.: Articulated pose estimation with ﬂexible mixtures-ofparts. In: CVPR. (2011)
34. Park, D., Ramanan, D.: N-best maximal decoders for part models. In: ICCV.
35. Drucker, H., Burges, C.J., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. NIPS (1997)
36. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) (2011)
37. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012)
38. Zheng, B., Zhao, Y., Yu, J.C., Ikeuchi, K., Zhu, S.C.: Detecting potential falling objects by inferring human action and natural disturbance. In: IEEE Int. Conf.
on Robotics and Automation (ICRA)(to appear). (2014)