A lightweight approach to two-person interaction classification in sparse image sequences

Włodzimierz Kasprzak; Van Khanh Do; Paweł Piwowarski

A lightweight approach to two-person interaction classification in sparse image sequences

Włodzimierz Kasprzak, Van Khanh Do, Paweł Piwowarski

DOI: http://dx.doi.org/10.15439/2022F195

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 181–190 (2022)

Full text

Abstract. A lightweight neural network-based approach to two-person interaction classification in sparse image sequences, based on predetection of human skeletons in video frames, is proposed. The idea is to use an ensemble of``weak'' pose classifiers, where every classifier is trained on a different time-phase of the same set of actions. Thus, differently than in typical assembly classifiers the expertise of``weak'' classifiers is distributed over time and not over the feature domain. Every classifier is trained independently to classify time-indexed snapshots of a visual action, while the overall classification result is a weighted combination of their results. The training data need not any extra labeling effort, as the particular frames are automatically adjusted with time indices.The use of pose classifiers for video classification is key to achieve a lightweight solution, as it limits the motion-based feature space in the deep encoding stage. Another important element is the exploration of the semantics of the skeleton data, which turns the input data into reliable and powerful feature vectors. In other words, we avoid to spent ANN resources to learn feature-related information, that can be already analytically extracted from the skeleton data. An algorithm for merging-elimination and normalization of skeleton joints is developed. Our method is trained and tested on the interaction subset of the well-known NTU-RGB-D dataset \cite{NTU-RGBD}, \cite{NTU-RGBDplus} - but only 2D skeleton data are used, as our ultimate goal is the analysis of video clips (and sequences of video frame-based images), located in the Internet. Our test results show comparable performance with some of the best reported STM- and CNN-based classifiers for this dataset \cite{NTU\_results}. We conclude that by reducing the noise of skeleton data and by proper sampling of a video clip, a successful lightweight-approach to visual interaction recognition can be achieved.

References

M. Liu and J. Yuan, “Recognizing Human Actions as the Evolution of Pose Estimation Maps”, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 1159-1168, http://dx.doi.org/10.1109/CVPR.2018.00127.
E. Cippitelli, E. Gambi, S. Spinsante, and F. Florez-Revuelta, “Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset,” in 2nd IET International Conference on Technologies for Active and Assisted Living (TechAAL 2016), London, UK, 24-25 October 2016, pp. 1-6, http://dx.doi.org/10.1049/ic.2016.0063.
S. Zhang, Z. Wei, J. Nie, L. Huang, S. Wang, and Z. Li, “A Review on Human Activity Recognition Using Vision-Based Method,” Journal of Healthcare Engineering, Hindawi, vol. 2017, Article ID 3090343, 31 pages, 2017, http://dx.doi.org/10.1155/2017/3090343, https://www.hindawi.com/journals/jhe/2017/3090343/
A. Wilkowski, W. Kasprzak and M. Stefanczyk, “Object detection in the police surveillance scenario,” in Proceedings of the 2019 Federated Conference on Computer Science and Information Systems, ACSIS, vol. 18, 2019, pp. 363-372, http://dx.doi.org/10.15439/2019F291 .
A. Stergiou and R. Poppe, “Analyzing human-human interactions: A survey,” Computer Vision and Image Understanding, Elsevier, vol. 188, 2019, p. 102799, http://dx.doi.org/10.1016/j.cviu.2019.102799, https://www.sciencedirect.com/science/article/pii/S1077314219301158
A. Bevilacqua, K. MacDonald, A. Rangarej, V. Widjaya, B. Caulfield, and T. Kechadi, “Human Activity Recognition with Convolutional Neural Networks,” in Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2018, Lecture Notes in Computer Science, vol. 11053, Springer, Cham, Switzerland, 2019, pp. 541-552, http://dx.doi.org/10.1007/978-3-030-10997-4_33.
N. A. Mac and N. H. Son, “Rotation Invariance in Graph Convolutional Networks,” in Proceedings of the 16th Conference on Computer Science and Intelligence Systems, ACSIS, vol. 25, 2021, pp. 81–90, http://dx.doi.org/10.15439/2021F140 .
Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Field,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172-186, Jan. 2021, http://dx.doi.org/10.1109/TPAMI.2019.2929257.
A. Toshev and C. Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1653-1660, http://dx.doi.org/10.1109/CVPR.2014.214.
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, ”Deepercut: a deeper, stronger, and faster multi-person pose estimation model,” in Computer Vision – ECCV 2016,, Lecture Notes in Computer Science, vol. 9910, Springer, Cham, Switzerland, 2016, pp. 34-50. https://doi.org/10.1007/978-3-319-46466-4_3.
H.-D. Duan, J. Wang, K. Chen and D. Lin, “PYSKL: Towards Good Practices for Skeleton Action Recognition,” https://arxiv.org/abs/2205.09443v1[cs.CV], 15 May 2022, https://arxiv.org/abs/2205.09443v1 (accessed on 15.07.2022).
[Online], “Papers with code. Action recognition in videos,” https://paperswithcode.com/task/action-recognition-in-videos, (accessed on 15.07.2022).
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts”, Neural Computation, vol. 3, no. 1, pp. 79–87, March 1991, http://dx.doi.org/10.1162/neco.1991.3.1.79.
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” https://arxiv.org/abs/1604.02808[cs.CV], 2016, https://arxiv.org/abs/1604.02808 (accessed on 15.07.2022).
H. Meng, M. Freeman, N. Pears, and C. Bailey, “Real-time human action recognition on an embedded, reconfigurable video processing architecture,” J. Real-Time Image Proc., vol. 3, no. 3, pp. 163–176, 2008, http://dx.doi.org/10.1007/s11554-008-0073-1.
K.G. Manosha Chathuramali and R. Rodrigo, “Faster human activity recognition with SVM,” International Conference on Advances in ICT for Emerging Regions (ICTer2012), Colombo, Sri Lanka, 12-15 December 2012, IEEE, 2012, pp. 197-203, http://dx.doi.org/10.1109/icter.2012.6421415.
X. Yan and Y. Luo, “Recognizing human actions using a new descriptor based on spatial–temporal interest points and weighted-output classifier,” Neurocomputing, Elsevier, vol. 87, pp. 51–61, 15 June 2012, http://dx.doi.org/10.1016/j.neucom.2012.02.002.
R. Vemulapalli, F. Arrate, and R. Chellappa, “Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 23-28 June 2014, Columbus, OH, USA, IEEE, 2014, pp. 588-595, doi: 10.1109/cvpr.2014.82.
J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognitio,” in Computer Vision – ECCV 2016, Lecture Notes in Computer Science, vol. 9907, Springer, Cham, Switzerland, 2016, pp. 816–833, http://dx.doi.org/10.1007/978-3-319-46487-9_50.
C. Li, Q. Zhong, D. Xie, and S. Pu, “Skeleton-based Action Recognition with Convolutional Neural Networks,” 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 10-14 July 2017, Hong Kong, pp. 597-600, http://dx.doi.org/10.1109/ICMEW.2017.8026285.
D. Liang, G. Fan, G. Lin, W. Chen, X. Pan, and H. Zhu, “Three-Stream Convolutional Neural Network With Multi-Task and Ensemble Learning for 3D Action Recognition,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 16-17 June 2019, Long Beach, CA, USA, IEEE, pp. 934-940, http://dx.doi.org/10.1109/cvprw.2019.00123.
S. Yan, Y. Xiong, and D. Lin, “Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition,” https://arxiv.org/abs/1801.07455 [cs.CV], 2018, https://arxiv.org/abs/1801.07455, (accessed on 15.07.2022).
M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15-20 June 2019, pp. 3590-3598, http://dx.doi.org/10.1109/CVPR.2019.00371.
L. Shi, Y. Zhang, J. Cheng and H.-Q. Lu, “Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition,” https://arxiv.org/abs/1805.07694v3 [cs.CV] , 10 July 2019, http://dx.doi.org/10.48550/ARXIV.1805.07694, https://arxiv.org/abs/1805.07694v3, (accessed on 15.07.2022).
L. Shi, Y. Zhang, J. Cheng, and H.-Q. Lu, “Skeleton-based action recognition with multi-stream adaptive graph convolutional networks,” IEEE Transactions on Image Processing, vol. 29, pp. 9532-9545, October 2020, http://dx.doi.org/10.1109/TIP.2020.3028207 .
M. Perez, J. Liu, and A.C. Kot, “Interaction Relational Network for Mutual Action Recognition,” https://arxiv.org/abs/1910.04963 [cs.CV], 2019, https://arxiv.org/abs/1910.04963 (accessed on 15.07.2022).
L-P. Zhu, B. Wan, C.-Y. Li, G. Tian, Y. Hou and K. Yuan, “Dyadic relational graph convolutional networks for skeleton-based human interaction recognition,” Pattern Recognition, Elsevier, vol. 115, 2021, p. 107920, http://dx.doi.org/10.1016/j.patcog.2021.107920.
[Online], “openpose”, CMU-Perceptual-Computing-Lab, 2021 https://github.com/CMU-Perceptual-Computing-Lab/openpose/ , (accessed on 15.07.2022).
[Online], “Keras: the Python deep learning API,” https://keras.io/ , (accessed on 15.07.2022).
[Online], “Keras Tuner,” https://keras-team.github.io/keras-tuner/ , (accessed on 15.07.2022).
T. Yu and H. Zhu, “Hyper-Parameter Optimization: A Review of Algorithms and Applications,” https://arxiv.org/abs/2003.05689 [cs.LG], 12 Mar 2020, https://arxiv.org/abs/2003.05689 , (accessed on 15.07.2022).
J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot, “Skeleton-Based Online Action Prediction Using Scale Selection Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 42, no. 6, pp. 1453–1467, 1 June 2020, http://dx.doi.org/10.1109/T-PAMI.2019.2898954.
J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global Context-Aware Attention LSTM Networks for 3D Action Recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, 21-26 July 2017, pp. 3671-3680, http://dx.doi.org/10.1109/CVPR.2017.391.
J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks,” IEEE Transactions on Image Processing (TIP), vol. 27, no. 4, pp. 1586-1599, April 2018, http://dx.doi.org/10.1109/TIP.2017.2785279.