An efficient activity recognition for homecare robots from multi-modal communication dataset

(1) * Mohamad Yani Mail (Tokyo Metropolitan University, Japan)
(2) Yamada Nao Mail (Tokyo Metropolitan University, Japan)
(3) Chyan Zheng Siow Mail (Tokyo Metropolitan University, Japan)
(4) Kubota Naoyuki Mail (Tokyo Metropolitan University, Japan)
*corresponding author


Human environments are designed and managed by humans for humans. Thus, adding robots to interact with humans and perform specific tasks appropriately is an essential topic in robotics research. In recent decades, object recognition, human skeletal, and face recognition frameworks have been implemented to support the tasks of robots. However, recognition of activities and interactions between humans and surrounding objects is an ongoing and more challenging problem. Therefore, this study proposed a graph neural network (GNN) approach to directly recognize human activity at home using vision and speech teaching data. Focus was given to the problem of classifying three activities, namely, eating, working, and reading, where these activities were conducted in the same environment. From the experiments, observations, and analyses, this proved to be quite a challenging problem to solve using only traditional convolutional neural networks (CNN) and video datasets. In the proposed method, an activity classification was learned from a 3D detected object corresponding to the human position. Next, human utterances were used to label the activity from the collected human and object 3D positions. The experiment, involving data collection and learning, was demonstrated by using human-robot communication. It was shown that the proposed method had the shortest training time of 100.346 seconds with 6000 positions from the dataset and was able to recognize the three activities more accurately than the deep layer aggregation (DLA) and X3D networks with video datasets.


Homecare robots; Activity prediction; Graph neural network; RGB-D camera; ROS



Article metrics

Abstract views : 413 | PDF views : 187




Full Text



[1] M. Yani, A. R. A. Besari, N. Yamada, and N. Kubota, “Ecological-Inspired System Design for Safety Manipulation Strategy in Home-care Robot,” 2020. doi: 10.1109/CcS49175.2020.9231354.

[2] H. Riaz, A. Terra, K. Raizer, R. Inam, and A. Hata, “Scene Understanding for Safety Analysis in Human-Robot Collaborative Operations,” 2020 6th Int. Conf. Control. Autom. Robot. ICCAR 2020, pp. 722–731, 2020, doi: 10.1109/ICCAR49639.2020.9108083.

[3] A. Carolina and H. Silva, “Scene Understanding for Autonomous Robots Operating in Indoor Environments by,” 2021. Available at : E-archivo.

[4] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” Proc. IEEE Int. Conf. Comput. Vis., pp. 1036–1043, 2011, doi: 10.1109/ICCV.2011.6126349.

[5] S. Wan, L. Qi, X. Xu, C. Tong, and Z. Gu, “Deep Learning Models for Real-time Human Activity Recognition with Smartphones,” Mob. Networks Appl., vol. 25, no. 2, pp. 743–755, 2020, doi: 10.1007/s11036-019-01445-x.

[6] K. Li, J. Wu, X. Zhao, and M. Tan, “Real-Time Human-Robot Interaction for a Service Robot Based on 3D Human Activity Recognition and Human-Mimicking Decision Mechanism,” 8th Annu. IEEE Int. Conf. Cyber Technol. Autom. Control Intell. Syst. CYBER 2018, pp. 498–503, 2019, doi: 10.1109/CYBER.2018.8688272.

[7] M. Latah, “Human action recognition using support vector machines and 3D convolutional neural networks,” Int. J. Adv. Intell. Informatics, vol. 3, no. 1, pp. 47–55, 2017, doi: 10.26555/ijain.v3i1.89.

[8] Y. A. Andrade-Ambriz, S. Ledesma, M. A. Ibarra-Manzano, M. I. Oros-Flores, and D. L. Almanza-Ojeda, “Human activity recognition using temporal convolutional neural network architecture,” Expert Syst. Appl., vol. 191, no. March 2021, p. 116287, 2022, doi: 10.1016/j.eswa.2021.116287.

[9] A. R. A. Besari, A. A. Saputra, W. H. Chin, Kurnianingsih, and N. Kubota, “Finger Joint Angle Estimation With Visual Attention for Rehabilitation Support: A Case Study of the Chopsticks Manipulation Test,” IEEE Access, vol. 10, no. September, pp. 91316–91331, 2022, doi: 10.1109/ACCESS.2022.3201894.

[10] N. Khalid, Y. Y. Ghadi, M. Gochoo, A. Jalal, and K. Kim, “Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling,” IEEE Access, vol. 9, pp. 111249–111266, 2021, doi: 10.1109/ACCESS.2021.3101716.

[11] M. Hassan, P. Ghosh, J. Tesch, D. Tzionas, and M. J. Black, “Populating 3D scenes by learning human-scene interaction,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 14703–14713, 2021, doi: 10.1109/CVPR46437.2021.01447.

[12] Y. Tian, L. Chen, W. Song, Y. Sung, and S. Woo, “Dgcb-net: Dynamic graph convolutional broad network for 3d object recognition in point cloud,” Remote Sens., vol. 13, no. 1, pp. 1–20, 2021, doi: 10.3390/rs13010066.

[13] L. Shi, S. Li, Q. Zheng, L. Cao, L. Yang, and G. Pan, “Maximum Entropy Reinforcement Learning with Evolution Strategies,” 2020, doi : 10.1109/IJCNN48605.2020.9207570.

[14] S. A. Tailor, R. De Jong, T. Azevedo, M. Mattina, and P. Maji, “Towards Efficient Point Cloud Graph Neural Networks Through Architectural Simplification,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 2095–2104, 2021, doi: 10.1109/ICCVW54120.2021.00237.

[15] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. W. Battaglia, “Learning to Simulate Complex Physics with Graph Networks,” , 2020, doi : 10.48550/arXiv.2002.09405.

[16] I. W. McBrearty and G. C. Beroza, “Earthquake Location and Magnitude Estimation with Graph Neural Networks,” pp. 1–5, 2022, doi : 10.1109/ICIP46576.2022.9897468.

[17] P. Ruiz Puentes et al., “Predicting target–ligand interactions with graph convolutional networks for interpretable pharmaceutical discovery,” Sci. Rep., vol. 12, no. 1, pp. 1–17, 2022, doi: 10.1038/s41598-022-12180-x.

[18] J. Xiong et al., “Multi-instance learning of graph neural networks for aqueous pKa prediction,” Bioinformatics, vol. 38, no. 3, pp. 792–798, 2022, doi: 10.1093/bioinformatics/btab714.

[19] W. Fan et al., “Graph neural networks for social recommendation,” Web Conf. 2019 - Proc. World Wide Web Conf. WWW 2019, pp. 417–426, May 2019, doi: 10.1145/3308558.3313488.

[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016. doi: 10.1109/CVPR.2016.91.

[21] C. Lugaresi et al., “MediaPipe: A Framework for Building Perception Pipelines,” 2019, doi : 10.48550/arXiv.1906.08172.

[22] W. Luo, W. Liu, and S. Gao, “Normal graph: Spatial temporal graph convolutional networks based prediction network for skeleton based video anomaly detection,” Neurocomputing, vol. 444, pp. 332–337, 2021, doi: 10.1016/j.neucom.2019.12.148.

[23] B. Parsa, A. Narayanan, and B. Dariush, “Spatio-Temporal Pyramid Graph Convolutions for Human Action Recognition and Postural Assessment,” in Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020, 2020, pp. 1069–1079. doi: 10.1109/WACV45572.2020.9093368.

[24] T. Ahmad, H. Mao, L. Lin, and G. Tang, “Action Recognition Using Attention-Joints Graph Convolutional Neural Networks,” IEEE Access, vol. 8, pp. 305–313, 2020, doi: 10.1109/ACCESS.2019.2961770.

[25] S. Mekruksavanich and A. Jitpattanakul, “LSTM Networks Using Smartphone Data for Sensor-Based Human Activity Recognition in Smart Homes,” Sensors, vol. 21, no. 5, pp. 1–25, 2021, doi: 10.3390/s21051636.

[26] M. Muaaz, A. Chelli, A. A. Abdelgawwad, A. C. Mallofré, and M. Pätzold, “WiWeHAR: Multimodal human activity recognition using Wi-Fi and wearable sensing modalities,” IEEE Access, vol. 8, pp. 164453–164470, 2020, doi: 10.1109/ACCESS.2020.3022287.

[27] V. Dutta and T. Zielinska, “Prognosing Human Activity Using Actions Forecast and Structured Database,” IEEE Access, vol. 8, pp. 6098–6116, 2020, doi: 10.1109/ACCESS.2020.2963933.

[28] S. Qi, W. Wang, B. Jia, J. Shen, and S. C. Zhu, “Learning human-object interactions by graph parsing neural networks,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11213 LNCS, pp. 407–423, 2018, doi: 10.1007/978-3-030-01240-3_25.

[29] R. Morais, V. Le, S. Venkatesh, and T. Tran, “Learning asynchronous and sparse human-object interaction in videos,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 16036–16045, 2021, doi: 10.1109/CVPR46437.2021.01578.

[30] Z. Liang, J. Liu, Y. Guan, and J. Rojas, “Visual-Semantic Graph Attention Networks for Human-Object Interaction Detection,” 2021 IEEE Int. Conf. Robot. Biomimetics, ROBIO 2021, pp. 1441–1447, 2021, doi: 10.1109/ROBIO54168.2021.9739429.

[31] T. Wang, T. Yang, M. Danelljan, F. S. Khan, X. Zhang, and J. Sun, “Learning Human-Object Interaction Detection using Interaction Points,” vol. 1, 2020, doi : 10.1109/CVPR42600.2020.00417.

[32] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017, vol. 2017-Janua, pp. 29–38. doi: 10.1109/CVPR.2017.11.

[33] T. Foote, “Tf: The transform library,” 2013. doi: 10.1109/TePRA.2013.6556373.

[34] M. Fey and J. E. Lenssen, “FAST GRAPH REPRESENTATION LEARNING WITH PYTORCH GEOMETRIC,” in The International Conference on Learning Representations (ICLR), 2019, no. 1, pp. 1–9, doi : 10.48550/arXiv.1903.02428.

[35] L. Wu, P. Cui, J. Pei, and L. Zhao, Graph Neural Networks: Foundations, Frontiers, and Applications. 2022. doi: 10.1007/978-981-16-6054-2.

[36] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep Layer Aggregation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 2403–2412. doi: 10.1109/CVPR.2018.00255.

[37] C. Feichtenhofer, “X3D: Expanding Architectures for Efficient Video Recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 200–210, 2020, doi: 10.1109/CVPR42600.2020.00028.

[38] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild,” no. November, 2012, doi : 10.48550/arXiv.1212.0402.

[39] W. Kay et al., “The Kinetics Human Action Video Dataset,” 2017, doi : 10.48550/arXiv.1705.06950.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
E: (paper handling issues) (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0