Empirical study of 3D-HPE on HOI4D egocentric vision dataset based on deep learning

(1) * Van Hung Le Mail (Tan Trao University, Viet Nam)
*corresponding author

Abstract


3D hand pose estimation (3D-HPE) is one of the tasks performed on data obtained from egocentric vision camera (EVC) such as hand detection, segmentation, and gesture recognition applied in fields such as HCI, HRI, VR, AR, Healthcare, supporting for the visually impaired people, etc. In these applications, hand point cloud data obtained from EV is not very challenging due to being obscured by gaze direction and other objects. Our paper performs a comparative study on 3D right-hand pose estimation (3D-R-HPE) from the HOI4D dataset with four cameras used to collect and animate the dataset. This is a very challenging dataset and was published at CVPR 2022. We use CNNs (P2PR PointNet, Hand PointNet, V2V-PoseNet, and HandFoldingNet - HFNet) to fine-tune the 3D-HPE model based on the point cloud data (PCD) of hand. The resulting error of 3D-HPE is presented as follows: P2PR PointNet (average error (Erra) is 32.71mm), Hand PointNet (average error (Erra) is 35.12mm), V2V-PoseNet (average error (Erra) is 26.32mm), and HFNet (average error (Erra) is 20.49mm). HFNet is the latest CNN (in 2021) with the best results. This estimation error is small and can be applied and modeled to automatically detect, estimate, and recognize hand pose from the data obtained by EV. The average processing time is 5.4fps when done on the GPU of the HFNet, which is the fastest. Detailed quantitative and qualitative results were presented that are beneficial to various applications such as human-computer interaction, virtual and augmented reality, and healthcare, particularly in challenging scenarios involving occlusions and complex datasets.

Keywords


Comparative study; 3D Hand Pose Estimation; HOI4D dataset; Egocentric Vision; Convolutional Neural Networks(CNNs)

   

DOI

https://doi.org/10.26555/ijain.v10i2.1360
      

Article metrics

Abstract views : 204 | PDF views : 30

   

Cite

   

Full Text

Download

References


[1] C. Bandi and U. Thomas, “Regression-Based 3D Hand Pose Estimation for Human-Robot Interaction,” in Communications in Computer and Information Science, vol. 1474 CCIS, Springer, Cham, 2022, pp. 507–529, doi: 10.1007/978-3-030-94893-1_24.

[2] Q. Gao, Y. Chen, Z. Ju, and Y. Liang, “Dynamic Hand Gesture Recognition Based on 3D Hand Pose Estimation for Human–Robot Interaction,” IEEE Sens. J., vol. 22, no. 18, pp. 17421–17430, Sep. 2022, doi: 10.1109/JSEN.2021.3059685.

[3] S. Tsutsui, Y. Fu, and D. Crandall, “Whose hand is this? Person Identification from Egocentric Hand Gestures,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2021, pp. 3398–3407, doi: 10.1109/WACV48630.2021.00344.

[4] M.-F. Tsai, R. H. Wang, and J. Zariffa, “Recognizing hand use and hand role at home after stroke from egocentric video,” PLOS Digit. Heal., vol. 2, no. 10, p. e0000361, Oct. 2023, doi: 10.1371/journal.pdig.0000361.

[5] K. Delloul and S. Larabi, “Egocentric Scene Description for the Blind and Visually Impaired,” in 2022 5th International Symposium on Informatics and its Applications (ISIA), Nov. 2022, pp. 1–6, doi: 10.1109/ISIA55826.2022.9993531.

[6] A. Bandini and J. Zariffa, “Analysis of the Hands in Egocentric Vision: A Survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6846–6866, Jun. 2023, doi: 10.1109/TPAMI.2020.2986648.

[7] G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim, “First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 409–419, doi: 10.1109/CVPR.2018.00050.

[8] V.-H. Le and H.-C. Nguyen, “A Survey on 3D Hand Skeleton and Pose Estimation by Convolutional Neural Network,” Adv. Sci. Technol. Eng. Syst. J., vol. 5, no. 4, pp. 144–159, Jul. 2020, doi: 10.25046/aj050418.

[9] L. Fan, H. Rao, and W. Yang, “3D Hand Pose Estimation Based on Five-Layer Ensemble CNN,” Sensors, vol. 21, no. 2, p. 649, Jan. 2021, doi: 10.3390/s21020649.

[10] J. H. R. Isaac, M. Manivannan, and B. Ravindran, “Single Shot Corrective CNN for Anatomically Correct 3D Hand Pose Estimation,” Front. Artif. Intell., vol. 5, p. 759255, Feb. 2022, doi: 10.3389/frai.2022.759255.

[11] J. Cheng et al., “Efficient Virtual View Selection for 3D Hand Pose Estimation,” Proc. AAAI Conf. Artif. Intell., vol. 36, no. 1, pp. 419–426, Jun. 2022, doi: 10.1609/aaai.v36i1.19919.

[12] Y. Liu et al., “HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, vol. 2022-June, pp. 20981–20990, doi: 10.1109/CVPR52688.2022.02034.

[13] L. Ge, Z. Ren, and J. Yuan, “Point-to-Point Regression PointNet for 3D Hand Pose Estimation,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11217 LNCS, Springer Verlag, 2018, pp. 489–505, doi: 10.1007/978-3-030-01261-8_29.

[14] G. Moon, J. Y. Chang, and K. M. Lee, “V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5079–5088. [Online]. Available at: https://openaccess.thecvf.com/content_cvpr_2018/papers/Moon_V2V-PoseNet_Voxel-to-Voxel_Prediction_CVPR_2018_paper.pdf.

[15] W. Cheng, J. H. Park, and J. H. Ko, “HandFoldingNet: A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 11240–11249, doi: 10.1109/ICCV48922.2021.01107.

[16] L. Huang, B. Zhang, Z. Guo, Y. Xiao, Z. Cao, and J. Yuan, “Survey on depth and RGB image-based 3D hand shape and pose estimation,” Virtual Real. Intell. Hardw., vol. 3, no. 3, pp. 207–234, Jun. 2021, doi: 10.1016/j.vrih.2021.05.002.

[17] T. Ohkawa, R. Furuta, and Y. Sato, “Efficient Annotation and Learning for 3D Hand Pose Estimation: A Survey,” Int. J. Comput. Vis., vol. 131, no. 12, pp. 3193–3206, Dec. 2023, doi: 10.1007/s11263-023-01856-0.

[18] C. R. Q. Li, Y. Hao, S. Leonidas, and J. Guibas, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,” in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 5105–5114. [Online]. Available at: 10.5555/3295222.3295263.

[19] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, vol. 2017-Janua, pp. 1263–1272, doi: 10.1109/CVPR.2017.139.

[20] F. Choi, S. Mayer, and C. Harrison, “3D Hand Pose Estimation on Conventional Capacitive Touchscreens,” in Proceedings of the 23rd International Conference on Mobile Human-Computer Interaction, Sep. 2021, pp. 1–13, doi: 10.1145/3447526.3472045.

[21] D. Drosakis and A. Argyros, “3D Hand Shape and Pose Estimation based on 2D Hand Keypoints,” in Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments, Jul. 2023, pp. 148–153, doi: 10.1145/3594806.3594838.

[22] M. Ivashechkin, O. Mendez, and R. Bowden, “Denoising Diffusion for 3D Hand Pose Estimation from Images,” in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Oct. 2023, pp. 3128–3137, doi: 10.1109/ICCVW60793.2023.00338.

[23] Y. Wen, H. Pan, L. Yang, J. Pan, T. Komura, and W. Wang, “Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, vol. 2023-June, pp. 21243–21253, doi: 10.1109/CVPR52729.2023.02035.

[24] E. Kazakos, J. Huh, A. Nagrani, A. Zisserman, and D. Damen, “With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition,” in 32nd British Machine Vision Conference, BMVC 2021, 2021, pp. 1–16, [Online]. Available at: https://www.bmvc2021-virtualconference.com/assets/papers/0610.pdf.

[25] V.-D. Le et al., “Hand Activity Recognition From Automatic Estimated Egocentric Skeletons Combining Slow Fast and Graphical Neural Networks,” Vietnam J. Comput. Sci., vol. 10, no. 01, pp. 75–100, Feb. 2023, doi: 10.1142/S219688882250035X.

[26] B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall, “HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 6607–6616, doi: 10.1109/CVPR42600.2020.00664.

[27] V.-H. Le, “Automatic 3D Hand Pose Estimation Based on YOLOv7 and HandFoldingNet from Egocentric Videos,” in 2022 RIVF International Conference on Computing and Communication Technologies (RIVF), Dec. 2022, pp. 161–166, doi: 10.1109/RIVF55975.2022.10013903.

[28] A. Prakash, R. Tu, M. Chang, and S. Gupta, “3D Hand Pose Estimation in Egocentric Images in the Wild,” arXiv Comput. Vis. Pattern Recognit., pp. 1–11, 2023, [Online]. Available at: http://arxiv.org/abs/2312.06583.

[29] C. Plizzari et al., “An Outlook into the Future of Egocentric Vision,” Int. J. Comput. Vis., pp. 1–57, May 2024, doi: 10.1007/s11263-024-02095-7.

[30] S. Pramanick et al., “EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 5262–5274, doi: 10.1109/ICCV51070.2023.00487.

[31] L. Zhang, S. Zhou, S. Stent, and J. Shi, “Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13689 LNCS, Springer, Cham, 2022, pp. 127–145, doi: 10.1007/978-3-031-19818-2_8.

[32] X. Gong et al., “MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, vol. 2023-June, pp. 6481–6491, doi: 10.1109/CVPR52729.2023.00627.

[33] L. Khaleghi, A. Sepas-Moghaddam, J. Marshall, and A. Etemad, “Multiview Video-Based 3-D Hand Pose Estimation,” IEEE Trans. Artif. Intell., vol. 4, no. 4, pp. 896–909, Aug. 2023, doi: 10.1109/TAI.2022.3195968.

[34] C. Plizzari et al., “E 2 (GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, vol. 2022-June, pp. 19903–19915, doi: 10.1109/CVPR52688.2022.01931.

[35] T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, and C. Keskin, “AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, vol. 2023-June, pp. 12999–13008, doi: 10.1109/CVPR52729.2023.01249.

[36] X. Wang, L. Zhu, H. Wang, and Y. Yang, “Interactive Prototype Learning for Egocentric Action Recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 8148–8157, doi: 10.1109/ICCV48922.2021.00806.

[37] M. Planamente, C. Plizzari, E. Alberti, and B. Caputo, “Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition,” in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Jan. 2022, pp. 163–174, doi: 10.1109/WACV51458.2022.00024.

[38] Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, “Learning to Anticipate Egocentric Actions by Imagination,” IEEE Trans. Image Process., vol. 30, pp. 1143–1152, 2021, doi: 10.1109/TIP.2020.3040521.

[39] H. Liu, R. Song, X. Zhang, and H. Liu, “Point cloud segmentation based on Euclidean clustering and multi-plane extraction in rugged field,” Meas. Sci. Technol., vol. 32, no. 9, p. 095106, Sep. 2021, doi: 10.1088/1361-6501/abead3.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
   andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0