Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board
<< Previous Next >>

2021, 3(3): 207-234

Published Date:2021-6-20 DOI: 10.1016/j.vrih.2021.05.002

Survey on depth and RGB image-based 3D hand shape and pose estimation

Abstract

The field of vision-based human hand three-dimensional (3D) shape and pose estimation has attracted significant attention recently owing to its key role in various applications, such as natural human-computer interactions. With the availability of large-scale annotated hand datasets and the rapid developments of deep neural networks (DNNs), numerous DNN-based data-driven methods have been proposed for accurate and rapid hand shape and pose estimation. Nonetheless, the existence of complicated hand articulation, depth and scale ambiguities, occlusions, and finger similarity remain challenging. In this study, we present a comprehensive survey of state-of-the-art 3D hand shape and pose estimation approaches using RGB-D cameras. Related RGB-D cameras, hand datasets, and a performance analysis are also discussed to provide a holistic view of recent achievements. We also discuss the research potential of this rapidly growing field.

Keyword

Hand survey ; 3D hand pose estimation ; Hand shape reconstruction ; Hand-object interactions ; RGB-D cameras

Cite this article

Lin HUANG, Boshen ZHANG, Zhilin GUO, Yang XIAO, Zhiguo CAO, Junsong YUAN. Survey on depth and RGB image-based 3D hand shape and pose estimation. Virtual Reality & Intelligent Hardware, 2021, 3(3): 207-234 DOI:10.1016/j.vrih.2021.05.002

References

1. Tompson J, Stein M, LeCun Y, Perlin K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics, 2014, 33(5): 169 DOI:10.1145/2629500

2. Zhou X, Wan Q, Zhang W, Xue X, Wei Y. Model-based deep hand pose estimation. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York, USA, AAAI Press, 2016, 2421–2427

3. Oikonomidis I, Kyriazis N, Argyros A A. Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: 2011 International Conference on Computer Vision. Barcelona, Spain, IEEE, 2011, 2088–2095 DOI:10.1109/iccv.2011.6126483

4. Qian C, Sun X, Wei Y C, Tang X O, Sun J. Realtime and robust hand tracking from depth. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 1106–1113 DOI:10.1109/cvpr.2014.145

5. de la Gorce M, Fleet D J, Paragios N. Model-based 3D hand pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1793–1805 DOI:10.1109/tpami.2011.33

6. , Heidelberg Berlin, Springer Berlin Heidelberg, 2011, 744–757 DOI:10.1007/978-3-642-19318-7_58

7. Xu C, Govindarajan L N, Zhang Y, Cheng L. Lie-X: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, 2017, 123(3): 454–478 DOI:10.1007/s11263-017-0998-6

8. Wan C D, Probst T, Gool L V, Yao A. Dense 3D regression for hand pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5147–5156 DOI:10.1109/cvpr.2018.00540

9. Newell A, Yang K Y, Deng J. Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 483–499. DOI:10.1007/978-3-319-46484-8_29

10. Barsoum E. Articulated hand pose estimation review. 2016

11. Chen W Y, Yu C C, Tu C Y, Lyu Z H, Tang J, Ou S Q, Fu Y, Xue Z D. A survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors (Basel, Switzerland), 2020, 20(4): E1074 DOI:10.3390/s20041074

12. Ye Q, Kim T K. Occlusion-aware hand pose estimation using hierarchical mixture density network. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 817–834. DOI:10.1007/978-3-030-01249-6_49

13. Zhang J W, Jiao J B, Chen M L, Qu L Q, Xu X B, Yang Q X. A hand pose tracking benchmark from stereo matching. In: 2017 IEEE International Conference on Image Processing (ICIP). Beijing, China, IEEE, 2017, 982–986 DOI:10.1109/icip.2017.8296428

14. Garcia-Hernando G, Yuan S X, Baek S, Kim T K. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 409–419 DOI:10.1109/cvpr.2018.00050

15. Zimmermann C, Brox T. Learning to estimate 3D hand pose from single RGB images. 2017 IEEE International Conference on Computer Vision (ICCV), 2017, 4913–4921 DOI:10.1109/iccv.2017.525

16. Ge L H, Ren Z, Li Y C, Xue Z H, Wang Y Y, Cai J F, Yuan J S. 3D hand shape and pose estimation from a single RGB image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10825–10834 DOI:10.1109/cvpr.2019.01109

17. Iqbal U, Molchanov P, Breuel T, Gall J, Kautz J. Hand pose estimation via latent 2.5D heatmap regression. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 125–143 DOI:10.1007/978-3-030-01252-6_8

18. Spurr A, Iqbal U, Molchanov P, Hilliges O, Kautz J. Weakly supervised 3D hand pose estimation via biomechanical constraints. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 211–228 DOI:10.1007/978-3-030-58520-4_13

19. World Population Prospects-Population Division. United Nations. 2020

20. O'Dea. Forecast Number of Mobile Users Worldwide 2020-2024. 2020

21. O'Dea. Smartphone Users 2020. 2020

22. Giancola S, Valenti M, Sala R M. A survey on 3D cameras: metrological comparison of time-of-flight, structured-light and active stereoscopy technologies. Cham: Springer International Publishing, 2018 DOI:10.1007/978-3-319-91761-0

23. Intel® RealSense™ Technology. 2020

24. Kinect for Windows. Kinect-Windows App Development. 2020

25. ROS.org. Xbox 360 Kinect Internals. 2021

26. 3D Camera Survey-ROS-Industrial. 2020

27. MYNT EYE Standard. MYNT EYE, www.mynteye.com/products/mynt-eye-stereo-camera.

28. Lu S, Metaxas D, Samaras D, Oliensis J. Using multiple cues for hand tracking and model refinement. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Proceedings. Madison, WI, USA, IEEE, 2003, II–443 DOI:10.1109/cvpr.2003.1211501

29. Bray M, Koller-Meier E, van Gool L. Smart particle filtering for high-dimensional tracking. Computer Vision and Image Understanding, 2007, 106(1): 116–129 DOI:10.1016/j.cviu.2005.09.013

30. Dundee, British Machine Vision Association, 2011 DOI:10.5244/c.25.101

31. Tkach A, Tagliasacchi A, Remelli E, Pauly M, Fitzgibbon A. Online generative model personalization for hand tracking. ACM Transactions on Graphics, 2017, 36(6): 1–11 DOI:10.1145/3130800.3130830

32. Delamarre Q, Faugeras O. 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding, 2001, 81(3): 328–357 DOI:10.1006/cviu.2000.0892

33. Poli R, Kennedy J, Blackwell T. Particle swarm optimization. Swarm Intelligence, 2007, 1(1): 33–57 DOI:10.1007/s11721-007-0002-0

34. Tagliasacchi A, Schröder M, Tkach A, Bouaziz S, Botsch M, Pauly M. Robust articulated-ICP for real-time hand tracking. Computer Graphics Forum, 2015, 34(5): 101–114 DOI:10.1111/cgf.12700

35. Taylor J, Bordeaux L, Cashman T, Corish B, Keskin C, Sharp T, Soto E, Sweeney D, Valentin J, Luff B, Topalian A, Wood E, Khamis S, Kohli P, Izadi S, Banks R, Fitzgibbon A, Shotton J. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics, 2016, 35(4): 1–12 DOI:10.1145/2897824.2925965

36. Oberweger M, Wohlhart P, Lepetit V. Hands deep in deep learning for hand pose estimation. In: Proceedings of 20th Computer Vision Winter Workshop (CVWW). 2015, 21–30

37. Oberweger M, Lepetit V. DeepPrior++: improving fast and accurate 3D hand pose estimation. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 585–594 DOI:10.1109/iccvw.2017.75

38. Guo H, Wang G, Chen X, Zhang C, Qiao F, Yang H J I. Region ensemble network: improving convolutional network for hand pose estimation. 2017

39. Guo H K, Wang G J, Chen X H, Zhang C R. Towards good practices for deep 3D hand pose.

40. Chen X H, Wang G J, Guo H K, Zhang C R. Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing, 2020, 395: 138–149 DOI:10.1016/j.neucom.2018.06.097

41. Ge L H, Liang H, Yuan J S, Thalmann D. Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 3593–3601 DOI:10.1109/cvpr.2016.391

42. Haque A, Peng B Y, Luo Z L, Alahi A, Yeung S, Li F F. Towards viewpoint invariant 3D human pose estimation. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 160–177 DOI:10.1007/978-3-319-46448-0_10

43. Toshev A, Szegedy C. DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 1653–1660 DOI:10.1109/cvpr.2014.214

44. Cao Z, Simon T, Wei S H, Sheikh Y. Realtime multi-person 2D pose estimation using part affinity fields. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1302–1310 DOI:10.1109/cvpr.2017.143

46. Wei S H, Ramakrishna V, Kanade T, Sheikh Y. Convolutional pose machines. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 4724–4732 DOI:10.1109/cvpr.2016.511

47. Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 472–487 DOI:10.1007/978-3-030-01231-1_29

48. Xiong F, Zhang B S, Xiao Y, Cao Z G, Yu T D, Zhou J T, Yuan J S. A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 793–802 DOI:10.1109/iccv.2019.00088

49. Ge L H, Liang H, Yuan J S, Thalmann D. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 5679–5688 DOI:10.1109/cvpr.2017.602

50. Deng X M, Yang S, Zhang Y D, Tan P, Wang H. Hand3D: hand pose estimation using 3D neural network. 2017

51. Chang J Y, Moon G, Lee K M. V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5079–5088 DOI:10.1109/cvpr.2018.00533

52. Charles R Q, Hao S, Mo K C, Guibas L J. PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 77–85 DOI:10.1109/cvpr.2017.16

53. Qi C R, Yi L, Su H, Guibas L J. PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA, Curran Associates Inc, 2017, 5105–5114

54. Maturana D, Scherer S. VoxNet: a 3D Convolutional Neural Network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany, IEEE, 2015, 922–928 DOI:10.1109/iros.2015.7353481

55. Ge L H, Cai Y J, Weng J W, Yuan J S. Hand PointNet: 3D hand pose estimation using point sets. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 8417–8426 DOI:10.1109/cvpr.2018.00878

56. Li S L, Lee D. Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 11919–11928 DOI:10.1109/cvpr.2019.01220

57. Moon G, Chang J Y, Suh Y, Lee K M. Holistic planimetric prediction to local volumetric prediction for 3D human pose estimation. 2017

58. Wang K Z, Zhai S F, Cheng H, Liang X D, Lin L. Human pose estimation from depth images via inference embedded multi-task learning. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands, New York, NY, USA, ACM, 2016 DOI:10.1145/2964284.2964322

59. Wang K Z, Lin L, Ren C J, Zhang W, Sun W X. Convolutional memory blocks for depth data representation learning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, Sweden, California, International Joint Conferences on Artificial Intelligence Organization, 2018 DOI:10.24963/ijcai.2018/387

60. Ge L H, Ren Z, Yuan J S. Point-to-point regression PointNet for 3D hand pose estimation. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 489–505 DOI:10.1007/978-3-030-01261-8_29

61. Pavlakos G, Zhou X W, Derpanis K G, Daniilidis K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1263–1272 DOI:10.1109/cvpr.2017.139

62. Armagan A, Garcia-Hernando G, Baek S, Hampali S, Rad M, Zhang Z, Xie S, Chen M X, Zhang B, Xiong F. Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. 2020

63. Cham, Springer International Publishing, 2018, 536–553

64. Malik J, Abdelaziz I, Elhayek A, Shimada S, Ali S A, Golyanik V, Theobalt C, Stricker D. HandVoxNet: deep voxel-based network for 3D hand shape and pose estimation from a single depth map. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 7111–7120 DOI:10.1109/cvpr42600.2020.00714

65. Wu X K, Finnegan D, O'Neill E, Yang Y L. HandMap: robust hand pose estimation via intermediate dense guidance map supervision. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 246–262 DOI:10.1007/978-3-030-01270-0_15

66. Madadi M, Escalera S, Baro X, Gonzalez J. End-to-end global to local CNN learning for hand pose recovery in depth data. 2017

67. Huang L, Tan J C, Liu J, Yuan J S. Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 17–33 DOI:10.1007/978-3-030-58595-2_2

68. Lille, France, IEEE, 2019 DOI:10.1109/fg.2019.8756559

69. Zhou X Y, Sun X, Zhang W, Liang S, Wei Y C. Deep kinematic pose regression. In: Lecture Notes in Computer Science. Cham: Springer International Publishing, 2016, 186–201 DOI:10.1007/978-3-319-49409-8_17

70. Hasson Y, Varol G, Tzionas D, Kalevatykh I, Black M J, Laptev I, Schmid C. Learning joint reconstruction of hands and manipulated objects. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 11799–11808 DOI:10.1109/CVPR.2019.01208

71. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA, Curran Associates Inc., 2017, 6000–6010

72. Ren P F, Sun H F, Qi Q. SRN: stacked regression network for real-time 3D hand pose estimation. In: The British Machine Vision Conference. 2019

73. Higuchi T, Yao X, Liu Y. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation, 2000, 4(4): 380–387 DOI:10.1109/4235.887237

74. Zhang L, Shi Z, Cheng M M, Liu Y, Bian J W, Zhou J T, Zheng G, Zeng Z. Nonlinear regression via deep negative correlation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(3): 982–998 DOI:10.1109/TPAMI.2019.2943860

75. Zhou Y, Lu J, Du K, Lin X, Sun Y, Ma X. HBE: hand branch ensemble network for real-time 3D hand pose estimation. In: Computer Vision–ECCV. Cham, Springer International Publishing, 2018, 521–536

76. Fang L, Liu X, Liu L, Xu H, Kang W. JGR-P2O: joint graph reasoning based pixel-to-offset prediction network for 3D hand pose estimation from a single depth image. In: Computer Vision–ECCV. Cham, Springer International Publishing, 2020, 120–137

77. Huang W, Ren P, Wang J, Qi Q, Sun H. AWR: Adaptive weighting regression for 3D hand pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 11061–11068

78. Montreal, Canada, Press MIT, 2014, 2672–2680

79. Shrivastava A, Pfister T, Tuzel O, Susskind J, Webb R. Learning from Simulated and unsupervised images through adversarial training. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017

80. Dibra E, Wolf T, Oztireli C, Gross M. How to refine 3D hand pose estimation from unlabelled depth data? In: 2017 International Conference on 3D Vision (3DV). 2017, 135–144 DOI:10.1109/3DV.2017.00025

81. Romero J, Tzionas D, Black M J. Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG), 2017

82. Zhang Z H, Xie S P, Chen M X, Zhu H C. HandAugment: a simple data augmentation for HANDS19 challenge task 1: depth-based 3D hand pose estimation. 2020

83. Rad M, Oberweger M, Lepetit V. Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018

84. Cai Y J, Ge L H, Cai J F, Yuan J S. Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 678–694 DOI:10.1007/978-3-030-01231-1_41

85. Wan C D, Probst T, Van Gool L, Yao A. Self-supervised 3D hand pose estimation through training by fitting. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10845–10854 DOI:10.1109/cvpr.2019.01111

86. Chen Y J, Tu Z G, Ge L H, Zhang D J, Chen R Z, Yuan J S. SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 6960–6969 DOI:10.1109/iccv.2019.00706

87. Wan C D, Probst T, Van Gool L, Yao A. Crossing nets: combining GANs and VAEs with a shared latent space for hand pose estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1196–1205 DOI:10.1109/cvpr.2017.132

88. Brahmbhatt S, Tang C C, Twigg C D, Kemp C C, Hays J. ContactPose: A dataset of grasps with object contact and hand pose. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 361–378 DOI:10.1007/978-3-030-58601-0_22

89. Lin F Q, Wilhelm C, Martinez T. Two-hand global 3D pose estimation using monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020

90. Malik J, Elhayek A, Nunnari F, Stricker D. Simple and effective deep hand shape and pose regression from a single depth image. Computers & Graphics, 2019, 85: 85–91 DOI:10.1016/j.cag.2019.10.002

91. Malik J, Elhayek A, Nunnari F, Varanasi K, Tamaddon K, Heloir A, Stricker D. DeepHPS: end-to-end estimation of 3D hand pose and shape by learning from synthetic depth. In: 2018 International Conference on 3D Vision (3DV). Verona, Italy, IEEE, 2018, 110–119 DOI:10.1109/3dv.2018.00023

92. Yuan S X, Ye Q, Stenger B, Jain S, Kim T K. BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 2605–2613 DOI:10.1109/cvpr.2017.279

93. Mueller F, Mehta D, Sotnychenko O, Sridhar S, Casas D, Theobalt C. Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 1284–1293 DOI:10.1109/iccvw.2017.82

94. Sridhar S, Mueller F, Zollhöfer M, Casas D, Oulasvirta A, Theobalt C. Real-time joint tracking of a hand manipulating an object from RGB-D input. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 294–310 DOI:10.1007/978-3-319-46475-6_19

95. Sharp T, Keskin C, Robertson D, Taylor J, Shotton J, Kim D, Rhemann C, Leichter I, Vinnikov A, Wei Y C, Freedman D, Kohli P, Krupka E, Fitzgibbon A, S.Accurate Izadi, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Seoul Republic of Korea, New York, NY, USA, ACM, 2015 DOI:10.1145/2702123.2702179

96. Ge L H, Liang H, Yuan J S, Thalmann D. Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 3593–3601 DOI:10.1109/cvpr.2016.391

97. Swansea, British Machine Vision Association, 2015 DOI:10.5244/c.29.33

98. Tzionas D, Ballan L, Srikantha A, Aponte P, Pollefeys M, Gall J. Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision, 2016, 118(2): 172–193 DOI:10.1007/s11263-016-0895-4

99. Tang D, Chang H J, Tejani A, Kim T. Latent regression forest: structured estimation of 3D articulated hand posture. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3786–3793 DOI:10.1109/CVPR.2014.490

100. Rogez G, Supancic Iii J S, Khademi M, Montiel J, Vision D R. 3D hand pose detection in egocentric RGB-D images. In: Proc. European Conference on Computer Vision Workshops (ECCVW). 2014

101. Xu C, Cheng L. Efficient Hand pose estimation from a single depth image. In: Proc. International Conference on Computer Vision (ICCV). 2013

102. Sridhar S, Oulasvirta A, Theobalt C. Interactive markerless articulated hand motion tracking using RGB and Depth data. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013

103. Yu Z X, Shin Yoon J, Lee I K, Venkatesh P, Park J, Yu J H, Park H S. HUMBI: a large multiview dataset of human body expressions. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 2987–2997 DOI:10.1109/cvpr42600.2020.00306

104. Joo H, Simon T, Sheikh Y. Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 8320–8329 DOI:10.1109/cvpr.2018.00868

105. Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C. GANerated hands for real-time 3D hand tracking from monocular RGB. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 49–59 DOI:10.1109/cvpr.2018.00013

106. Panteleris P, Oikonomidis I, Argyros A. Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, NV, USA, IEEE, 2018, 436–445 DOI:10.1109/wacv.2018.00054

107. Xiang D L, Joo H, Sheikh Y. Monocular total capture: posing face, body, and hands in the wild. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10957–10966 DOI:10.1109/cvpr.2019.01122

108. Tsoli A, Argyros A A. Joint 3D tracking of a deformable object in interaction with a hand. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 504–520 DOI:10.1007/978-3-030-01264-9_30

109. Pavlakos G, Choutas V, Ghorbani N, Bolkart T, Osman A A, Tzionas D, Black M J. Expressive body capture: 3D hands, face, and body from a single image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10967–10977 DOI:10.1109/cvpr.2019.01123

110. Spurr A, Song J, Park S, Hilliges O. Cross-modal deep variational hand pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 89–98 DOI:10.1109/cvpr.2018.00017

111. Yang L L, Li S L, Lee D, Yao A. Aligning latent spaces for 3D hand pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 2335–2343 DOI:10.1109/iccv.2019.00242

112. Theodoridis T, Chatzis T, Solachidis V, Dimitropoulos K, Daras P. Cross-modal variational alignment of latent spaces. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, WA, USA, IEEE, 2020, 4127–4136 DOI:10.1109/cvprw50498.2020.00488

113. Kingma D P, Welling M. Auto-encoding varia-tional bayes. 2013

114. Yang L L, Yao A. Disentangling latent hands for image synthesis and pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 9869–9878 DOI:10.1109/cvpr.2019.01011

115. Baek S, Kim K I, Kim T K. Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 6120–6130 DOI:10.1109/cvpr42600.2020.00616

116. Hasson Y, Tekin B, Bogo F, Laptev I, Pollefeys M, Schmid C. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 568–577 DOI:10.1109/cvpr42600.2020.00065

117. Baek S, Kim K I, Kim T K. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 1067–1076 DOI:10.1109/cvpr.2019.00116

118. Boukhayma A, de Bem R, Torr P H S. 3D hand shape and pose from images in the wild. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10835–10844 DOI:10.1109/cvpr.2019.01110

119. Yang J, Chang H J, Lee S, Kwak N. SeqHAND: RGB-sequence-based 3D hand pose and shape estimation. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 122–139 DOI:10.1007/978-3-030-58610-2_8

120. Zhang X, Li Q, Mo H, Zhang W B, Zheng W. End-to-end hand mesh recovery from a monocular RGB image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 2354–2364 DOI:10.1109/iccv.2019.00244

121. Zhou Y, Habermann M, Xu W, Habibie I, Xu F. Monocular realtime hand shape and motion capture using multi-modal data. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020

122. Kulon D, Güler R A, Kokkinos I, Bronstein M M, Zafeiriou S. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 4989–4999 DOI:10.1109/cvpr42600.2020.00504

123. Moon G, Lee K M. I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 752–768 DOI:10.1007/978-3-030-58571-6_44

124. Cai Y, Ge L, Liu J, Cai J, Cham T, Yuan J, Thalmann N M. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 2272-2281 DOI:10.1109/ICCV.2019.00236

125. Fan Z P, Liu J, Wang Y. Adaptive computationally efficient network for monocular 3D hand pose estimation. In: Proc. European Conference on Computer Vision (ECCV). 2020

126. Han S, Liu B, Cabezas R, Twigg C D, Wang R. MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Transactions on Graphics (TOG), 2020

127. Panteleris P, Kyriazis N, Argyros A A. 3D tracking of human hands in interaction with unknown objects. In: Proc. British machine vision conference (BMVC). 2015

128. Panteleris P, Argyros A. Back to RGB: 3D tracking of hands and hand-object interactions based on short-baseline stereo. In: Proc. IEEE International Conference on Computer Vision Workshops (ICCVW). 2017

129. Romero J, Kjellström H, Kragic D. Hands in action: real-time 3D reconstruction of hands in interaction with objects. In: 2010 IEEE International Conference on Robotics and Automation. Anchorage, AK, USA, IEEE, 2010, 458–463 DOI:10.1109/robot.2010.5509753

130. Choi C, Yoon S H, Chen C, Ramani K. Robust hand pose estimation during the interaction with an unknown object. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 3142–3151 DOI:10.1109/ICCV.2017.339

131. Tekin B, Bogo F, Pollefeys M. Recognition P. H+O: Unified egocentric recognition of 3D hand-object poses and interactions. 2019

132. Doosti B, Naha S, Mirbagheri M, Crandall D J. HOPE-Net: A Graph-based model for hand-object pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020 DOI:10.1109/CVPR42600.2020.00664

133. Huang L, Tan J, Meng J, Liu J, Yuan J. HOT-Net: Non-autoregressive transformer for 3D hand-object pose estimation. In: Proceedings of the 28th ACM International Conference on Multimedia. Seattle, WA, USA. Association for Computing Machinery, 2020, 3136–3145 DOI:10.1145/3394171.3413775

134. Cham, Springer International Publishing, 2020, 548–564

135. Wang Y, Peng C, Liu Y. Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(11): 3258–3268 DOI:10.1109/TCSVT.2018.2879980

136. Zimmermann C, Ceylan D, Yang J, Russell B, Argus M J, Brox T. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 813–822 DOI:10.1109/ICCV.2019.00090

137. Hampali S, Rad M, Oberweger M, Lepetit V. HOnnotate: A Method for 3D Annotation of Hand and Object Poses. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3193–3203 DOI:10.1109/CVPR42600.2020.00326

138. Simon T, Joo H, Matthews I, Sheikh Y. Hand keypoint detection in single images using multiview bootstrapping. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4645–4653 DOI:10.1109/CVPR.2017.494

139. Gomez-Donoso F, Orts-Escolano S, Cazorla M. Large-scale multiview 3D hand pose dataset. Image and Vision Computing, 2019, 81: 25–33 DOI:10.1016/j.imavis.2018.12.001

140. Chang A X, Funkhouser T, Guibas L, Hanrahan P, Yu F. ShapeNet: an information-rich 3D model repository. 2015