Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


1 Introduction2 Challenge analysis3 RGB-D cameras3.1 Time-of-flight (TOF) cameras3.2 Triangulation (stereo) cameras3.3 Intel realsense and microsoft kinect3.4 Depth camera limitations and noise4 State-of-the-art depth image-based approaches4.1 Generative approaches4.2 Discriminative approaches: Regression-based4.3 Discriminative approaches: Detection-based4.4 Discriminative approaches: Structural constraint4.5 Discriminative approaches: Multi-stage prediction4.6 Discriminative approaches: Ensemble prediction4.7 Discriminative approaches: Synthetic data4.8 Discriminative approaches: Weakly supervised, self-supervised, and semi-supervised learning5 Depth datasets5.1 NYU hand pose dataset5.2 ICVL dataset5.3 MSRA15 dataset5.4 BigHand2.2M dataset5.5 Rendered handpose dataset (RHD)5.6 Dataset improvement solutions6 State-of-the-art RGB image-based approaches6.1 Generative approaches6.2 Discriminative approaches: 2D-to-3D lifting6.3 Discriminative approaches: Cross-Modality6.4 Discriminative approaches: Disentanglement6.5 Discriminative approaches: Model-based 3D hand reconstruction6.6 Discriminative approaches: Model-free 3D hand reconstruction6.7 Discriminative approaches: Weakly-supervised and semi-supervised learning6.8 Discriminative approaches: Sequential modeling and tracking7 State-of-the-art RGB image-based hand-object interaction modeling7.1 Generative approaches7.2 Discriminative approaches: Joint estimation7.3 Discriminative approaches: Weakly supervised8 RGB datasets8.1 STB dataset8.2 FHAD dataset8.3 ObMan dataset8.4 HO-3D dataset8.5 FreiHAND dataset9 Conclusions

2021,  3 (3):   207 - 234

Published Date:2021-6-20 DOI: 10.1016/j.vrih.2021.05.002


The field of vision-based human hand three-dimensional (3D) shape and pose estimation has attracted significant attention recently owing to its key role in various applications, such as natural human-computer interactions. With the availability of large-scale annotated hand datasets and the rapid developments of deep neural networks (DNNs), numerous DNN-based data-driven methods have been proposed for accurate and rapid hand shape and pose estimation. Nonetheless, the existence of complicated hand articulation, depth and scale ambiguities, occlusions, and finger similarity remain challenging. In this study, we present a comprehensive survey of state-of-the-art 3D hand shape and pose estimation approaches using RGB-D cameras. Related RGB-D cameras, hand datasets, and a performance analysis are also discussed to provide a holistic view of recent achievements. We also discuss the research potential of this rapidly growing field.


1 Introduction
Vision-based human hand three-dimensional (3D) shape and pose estimation is used to estimate the 3D hand shape surface, 3D hand joint locations, and important hand states based solely on visual observations without the need for additional sensing devices such as markers or gloves. As a long-standing problem in computer vision, its primary role in human behavior analysis and understanding has led to many practical applications.
The hand is one of the most mechanically and anatomically complex parts of the human body. Given the reconstructed 3D hand shape surface and pose, it enables touchless gesture-based human-computer interaction (HCI), and can thus be used as intuitive and immersive interfaces for virtual reality (VR), augmented reality (AR), interactive games, and computer-aided design (CAD), which can also benefit recognition tasks, such as action recognition and sign language recognition; other potential applications include the imitation-based learning of robotic skills and robot grasping. Despite extensive studies in the literature over the years, it remains a challenging problem owing to complex articulation, large variations in orientation, serious depth and scale ambiguities, severe occlusions, and a lack of characteristic local features.
One of the earliest methods relied mainly on model-driven approaches for finding the best hand shape surface and pose parameters by fitting a 3D articulated hand model[1-6] to input visual data. Although this approach has been shown to be effective, it requires complex personalized model calibration, is sensitive to initialization, and is susceptible to getting stuck in local minima.
With the availability of a large number of annotated hand datasets captured by commodity depth cameras, such as Microsoft Kinetic and Intel RealSense, data-driven approaches have become more advantageous and have led to significant advances in depth-based 3D hand pose estimation. Early work in this area utilizes search-based methods, which cannot perform well in high-dimensional space. The subsequent random forests-based techniques and their variants are limited by the use of handcrafted features. With the advancements in deep learning algorithms in recent years, we can capitalize on the capability of deep learning models to learn sensible hand priors and features in an end-to-end manner given input depth data, which has shown promising results.
Recently, an increasing number of studies have focused on the development of more affordable and convenient RGB-based solutions. Although RGB-based 3D hand analysis is a more challenging task and an inherently ill-posed inverse problem owing to the increased depth and scale ambiguities, previous work has obtained promising results for 3D pose estimation. Recently, for more potential practical applications, a fast-growing body of research literature has focused on dense 3D hand surface reconstruction in addition to solely working on sparse 3D hand joint localization.
Despite the impressive development of 3D hand shape and pose estimation tasks, most existing studies have focused on hand-only scenarios. The field of 3D hand reconstruction under hand-object interaction (HOI) scenarios has received far less attention. However, as we use our hands frequently in daily activities, interaction with surrounding objects has always been an essential part of human daily behavior, especially through hand sensing and manipulating everyday objects. Therefore, the ability to recover 3D hands interacting with objects is also of practical importance for interpreting and imitating human behavior, leading to many applications, including VR/AR, surveillance, tangible computing, grasping, teleoperation, and robotic learning. In recent years, we have also observed increased interest and research in 3D hand analysis under HOI scenarios.
2 Challenge analysis
Despite the significant success achieved in recent years, articulated 3D hand shape and pose estimation remains a largely unsolved problem, and it is being extensively researched by both industry and academia. The issues and ambiguities of this 3D estimation task are mainly based on the following factors.
Articulation. To accurately estimate 3D hand poses, it is important to model the articulation as well as the kinematic structure of the human hand. The human hand consists of five fingers, with each finger as a kinematic chain containing hand joints with one or two degrees of freedom (DoFs), as shown in Figure 1. As a highly articulated object with a large DoF, the hand is also inherently structured. The dependencies among joints within the same finger and among joints between fingers should also be captured for pose estimation. We also have an extra 6 DoFs for the global orientation and position of the hand palm. These six DoFs can be ignored when we focus on the 3D hand pose itself, which corresponds to the local finger motions instead of the global motion.
Orientation. As mentioned earlier, we have six DoFs of the hand palm representing the rotational and translational global hand motions. In addition to the hand rotating itself, it can also be captured in many ways with respect to the camera from a wide range of views. The large variation in the global orientation of the hand should be modeled precisely for accurate 3D pose estimation.
Occlusion. Owing to the 3D-to-two-dimensional (2D) projection, inferring a 3D structure from a monocular single 2D observation is an ill-posed inverse problem, where multiple plausible 3D solutions can be found. An important difficulty is posed by serious occlusion. In the 2D domain, some parts of the observed hand are usually occluded by their own parts or other objects during manipulation, and are thus invisible. Moreover, as mentioned in previous literature[8,9], because human hands are highly articulated and inherently structured, the learning model should be able to extract and integrate features across different scales from the input to generate precise pose results. However, the inability to see certain hand parts causes a loss of useful information and huge ambiguities in precise hand pose estimation, leading to a one-to-many mapping. According to previous works[10-12], the occlusion issue is a major contributor to poor machine performance.
Self-Similarity. Unlike human bodies with clothes, eyes, and mouths in faces, human hands have an almost uniform appearance across the five fingers. This lack of characteristic local features is likely to increase the difficulties with distinguishing different hand parts, including joints and fingers, which may lead to implausible hand pose results.
Depth and scale ambiguities. As an ill-posed inverse problem, it is difficult to estimate the absolute 3D hand pose on a global scale in a camera coordinate system because of the depth and scale ambiguities, especially with a single RGB frame as input data. Specifically, the depth can be referred to as the depth of the hand surface vertices or pose joints relative to the root joint, as well as the absolute hand root (palm) location in the depth axis. The scale represents the global scale of the observed hand. Simply relying on learning models to resolve both ambiguities will not usually work, or may lead to serious overfitting to the environment and subjects present in the training data. Previous works either choose to estimate root-relative and scale-invariant hand representation[15,16] with strong assumptions/priors, such as a given 3D root location and the global scale of hand or choosing to recover the absolute 3D hand pose up to a scaling factor[17,18]. The problem becomes more challenging to solve, especially when focusing on cropped hand regions and other issues such as occlusion and self-similarity.
Noise. In contrast to RGB cameras that passively collect (mostly) collimated beams, depth cameras either actively emit light that bounces back from objects or use stereo vision to compare images of the same object from different angles. However, actively emitted light and stereo vision decay with distance. As a result, depth cameras suffer from a significant amount of noise, which increases as the object moves farther away. Consequently, the point cloud constructed from the depth image will be very inaccurate, and regions on the point cloud will often be missing, notably increasing training and application difficulties.
Clutter. For captured RGB images, human hands can often be blended into their environment, making them difficult to detect.
Limited Datasets. Annotated datasets are critical for deep learning-based methods. However, as discussed later in this survey, each public dataset has its own limitations, such as the scale and variability. There are few datasets that cover all aspects of the 3D hand pose estimation.
3 RGB-D cameras
As stated in a previous work[11], there exist more than one path to capture information that can be evaluated to extract the hand pose. For example, cameras can capture the partial (occluded) status of the hand surface to infer the true positions of the joints. Data gloves using bend sensors or haptic sensors collect static or in-motion data on angles or pressure on each area of the hand to predict the relative hand pose. Movement or inertia sensors record the hand's motion within a certain period of time, and use prior knowledge to conclude the hand's gesture.
Of the methods used to collect information on hand gestures, cameras are likely to have the widest range of possible applications. In 2020, the global population was 7.8 billion people[19], 6.95 billion of whom possess mobile phones[20], which are likely to have cameras. Furthermore, there are 3.5 billion people using smart-phones[21], which usually have access to both a camera and an architecture that can run computer-vision-related software. As a result, cameras are considered the most affordable and promising method for capturing hand pose information with meaningful applications.
In contrast to RGB cameras that produce images with a 3-channel RGB value at each pixel, depth cameras generate images with a z-axis value (depth) at each pixel position to create a map of varying ranges. There are two popular depth camera designs, one which uses the flight time of light to directly measure distance, while the other uses triangulation.
3.1 Time-of-flight (TOF) cameras
The most straightforward method of creating a range map is to directly measure the range by emitting a wave (sonic, radio, or light) and measuring the time taken for the emitted wave to bounce on the object and return, as is the case with radars and sonars. Time-of-flight depth cameras also use this principle, and they employ an array of emitters to emit light toward the scene, while to calculate the distance, a board of sensors (receiver) is placed behind the optics of the depth camera to receive light that is bounced back[22].
3.2 Triangulation (stereo) cameras
Range information can also be extracted with triangulation-based depth cameras based on the disparities when observed from different viewpoints, which is how humans sense depth with two eyes. A type of depth camera that most closely follows this principle is the stereo vision camera, which usually consists of two cameras to compare visual dissimilarities of the same object in two images, either actively with a light emitter or passively. Another type of triangulation depth camera is structured light. This type of depth sensor only has one camera, but it actively projects a light pattern onto the scene using a laser projector, which interprets depth by performing triangulation on the projected pattern as seen by the camera with the original pattern.
3.3 Intel realsense and microsoft kinect
One of the most widely used depth camera series for hand pose estimation is Intel RealSense. Using either active or passive IR stereo sensors, this line of depth camera products boasts high close-range accuracy, which is ideal for capturing precise hand gestures[23]. Another well-known depth camera family that contributed to the widespread affordability and availability of depth sensing is the Microsoft Kinect series, as shown in Figure 4, which first used a structured light stereo sensor and later switched to the time-of-flight method. Originally made and priced for the gaming market rather than a professional setting, Kinect offers great value as it comes with depth sensors, RGB cameras, and microphones[24].
3.4 Depth camera limitations and noise
Unlike RGB cameras, which can have an almost unlimited imaging range as long as visibility allows, common consumer-grade depth cameras, such as Microsoft Kinect 2.0, have a depth range of 0.5 to 4.5m[26]. With such a narrow imaging range, it is difficult to construct training datasets and real-world applications. More recently, some depth cameras have been designed with a depth range of up to 18m[27]; however, such cameras can be more expensive or somewhat difficult to integrate into existing systems, and the depth range is often extremely noisy.
As mentioned above, noise is another technical constraint that is related to depth cameras. In contrast to RGB cameras that passively collect (mostly) collimated beams, depth cameras either actively emit light that bounces back from objects or use stereo vision to compare images of the same object from different angles. However, actively emitted light and stereo vision decay with distance. As a result, depth cameras suffer from a significant amount of noise, which increases as the object moves farther away. Consequently, the point cloud constructed from the depth image will be very inaccurate, and regions on the point cloud will often be missing, notably increasing training and application difficulties.
In contrast to occlusion and pose variations, technical limitations of depth cameras can often be improved with better or newer technology. However, such improvements often come at a cost, either in equipment investment or time.
4 State-of-the-art depth image-based approaches
Generally, state-of-the-art 3D hand pose estimation approaches using depth images can be categorized into two groups, namely generative (model) and discriminative (learning)-based paradigms. The generative approach is also referred to as a model-based manner. A 3D hand model is required according to prior knowledge of the anatomy of the hand. It is optimized iteratively to fit the input hand shape during the hand gesture estimation procedure. Discriminative methods are generally run in a data-driven manner. That is, the relationship between the input depth image and 3D hand joint positions is directly built via regression or detection. This paradigm was recently facilitated by deep learning technology owing to its strong pattern-fitting capacity. In addition, the deep learning-based approaches are the primary focus of this study. However, they still suffer from a data-hungry problem with a potential overfitting risk.
4.1 Generative approaches
A 3D hand model that serves as a generative paradigm is built to satisfy the hand morphology constraints in the form of a skeleton model[1,2], sphere model[3,4], triangulated mesh model[5], cylindrical model[6], etc. The parameters within these models are usually initialized using the pose from the previous frame. The nonconvex energy function is then defined to measure the discrepancy between the live human hand and the chosen hand model. Live hand and hand models are often characterized using hand-crafted visual features from the perspectives of depth value, edge, silhouette, shading, and optical flow[5,28-32]. During the pose optimization procedure, the energy function was iteratively minimized. Particle swarm optimization (PSO)[33], iterative closest point (ICP)[34], and other nonlinear optimization algorithms[35] are often used to solve this problem. The main defects of generative approaches lie in the three folders. First, the performance is sensitive to model initialization. Meanwhile, the online running efficiency for most generative approaches is not sufficiently high for practical application scenarios. Moreover, the generalization ability cannot be guaranteed owing to the limitations of the handcrafted 3D hand model.
Recently, there has been more research focus on deep learning-based discriminative methods. Under this framework, a subtle 3D hand model initialization and online iterative optimization are required. In addition, they achieve state-of-the-art performance on several model 3D hand pose estimation datasets, showing strong potential for practical applications in terms of both effectiveness and efficiency. Next, we introduce and discuss in detail deep learning-based discriminative methods.
4.2 Discriminative approaches: Regression-based
In this study, the 3D hand pose estimation task is formulated as a nonlinear regression problem, where the mapping from input depth data and output joint coordinates are directly learned. End-to-end learning with deep neural networks is generally conducted to directly predict continuous joint coordinates[36-42] in order to link the procedures of feature extraction and regression.
When conducting regression operations on depth images directly, 2D CNNs are often used. They have already achieved great success for RGB-based 2D human pose estimation tasks[43-47]. Because the depth image has the same dimensionality as the RGB counterpart, a 2D CNN can also be applied to it. The baseline approach involves conducting global regression in a one-stage manner[36,37]. However, this tends to lose fine visual clues, which may weaken the performance. One solution is to execute local refinement via multi-stage cascaded regression[36,37,40], in addition to the initial global regression results. Another method is to extract multiple local region features characterized by individual fully connected (FC) layers, and to fuse the local features via concatenation as the ensemble feature for global regression[38]. This can be regarded as a feature-level local ensemble method. Essentially, the fine visual clues can be better maintained by the local regions, and local feature concatenation aims to capture global context information. Recently, A2J[48] was proposed to conduct decision-level local ensemble learning as a one-stage method. In particular, each local region performs the role of predicting the 3D position of the hand joints first individually by estimating their position offsets toward the pre-defined anchor points. Then, the prediction results from multiple local regions are aggregated by linearly weighted voting. The entire procedure can be trained in an end-to-end manner. Owing to the introduction of the ensemble learning mechanism, the generalization ability can be enhanced significantly, especially for unseen cases.
Meanwhile, to better capture the hand's 3D characteristics, the depth image can also be transformed into a point set or voxel form. Thus, 3D hand pose estimation via regression can be facilitated using 3D deep learning methods. Representative examples include 3D CNN[49-51] and point-set networks[52,53]. Under the 3D CNN framework, the depth image is first voxelized into a volumetric representation (e.g., occupancy grid model[54] or projective D-TSDF volume[41]). Then, the 3D visual characteristics of the joints were captured using 3D convolution. In spite of the performance enhancement, 3D CNN generally has a much larger model size than its 2D counterparts. This leads to a defect in that it is relatively more difficult to train, and it suffers from a higher overfitting risk. On the other hand, recently emerged point-set networks (e.g., Pointnet[52] or Pointnet++[53]) are much more lightweight than 3D CNN and even 2D CNN, but still have a strong 3D pattern representation capacity. With a point-set network, the depth image is transformed into a point cloud form in advance as the input. Some preprocessing procedures (e.g., point sampling and k-nearest-neighbor search) are required in these approaches[55,56]. Under both the 3D CNN and point-set network framework, the regression operation is executed using the fully connected layers. In addition to the point set or voxel form, another way to involve richer 3D descriptive clues is to conduct multi-view projection on depth images[41]. In addition, a multiview 2D CNN model was proposed to fuse the multi-view information.
4.3 Discriminative approaches: Detection-based
Detection-based approaches generally conduct dense predictions on depth images, point sets, or voxel sets, to produce a dense probability map (i.e., heatmap) for each joint, including 2D heatmaps[1,8,57-59] and 3D heatmaps[51,60,61]. To this end, a few deconvolution layers are added to the backbone 2D CNN, 3D CNN, and point-set networks to estimate the heatmap for each joint. In particular, fully convolutional network (FCN)[8] (2D CNN-based), V2V[51] (3D CNN-based), and P2P[60] (point-set network-based) are the representative encoder-decoder based dense prediction networks for 3D hand pose estimation. It is worth noting that richer spatial contextual information can be maintained by the encoder-decoder network structure. Consequently, detection-based approaches are usually more accurate than regression-based approaches[62]. However, the embedded deconvolution operation is time-consuming. Another essential issue is that most encoder-decoder based methods cannot be fully end-to-end trained for joint position prediction[47].
Hybrid methods that jointly enable the advantages of regression-based and detection-based paradigms have been proposed. An integral regression loss function[63] was proposed to measure the distance between the predicted heatmaps and human-labeled joint coordinates. It enables end-to-end training of the encoder-decoder network. A2J[48] predicted the 3D joint position by aggregating the offset estimation results of multiple anchor points. The method of setting the anchor points with different aggregation weights can be regarded as an approximate sparse heatmap representation. HandVoxNet[64] follows V2V to predict 3D joint heatmaps, and it then splits this outcome into voxelized hand shape and shape surface representations. The 3D hand shape was accurately regressed by learning the mapping of the shape surface onto the voxelized hand shape. HandMap[65] proposed leveraging dense heatmaps through intermediate supervision in a regression-based framework.
4.4 Discriminative approaches: Structural constraint
The human hand is a highly articulated 3D structure. Generative methods embed prior knowledge of hand anatomy into the predefined hand model. Accordingly, some kinematically implausible pose estimation results can be avoided[3-6]. However, under the data-driven discriminative framework, any predicted joint combination may appear if no prior structural constraint of the hand is imposed, especially on a relatively small-scale dataset.
To address this, research efforts have been made to embed the hand structural constraint into discriminative approaches. Some studies have focused on physical hand motion constraints[2,36,37,55,60,66-70]. However, principal component analysis (PCA) is used to project the raw feature into a lower dimensional space for more robust 3D hand pose estimation[36,37,55,60]. PCA can compress the dimensionality of the feature, but still preserve the main intrinsic information. Thus, the effect of a fine noisy pattern can be resisted to better capture the principal 3D hand structure. Meanwhile, some other efforts have focused on incorporating hand kinematics clues by learning the latent distribution of hand joints[68] or designing a forward kinematics-based layer[2,69] to ensure the geometric validity of estimated hand poses. Physical constraints on natural hand motion and deformation are also considered as regularization for defining the loss function[66]. Recently, a transformer model[71] for natural language processing was introduced to 3D hand pose estimation in order to better exploit the structural correlations among the hand joints[67].
4.5 Discriminative approaches: Multi-stage prediction
It is challenging to obtain an accurate joint location with only one forward pass because the hand pose is essentially articulated, and the mapping function between the input depth image and joint coordinates is highly nonlinear. One way to enhance the performance is to conduct multi-stage prediction with iterative refinement. Deepprior[36,37] consists of 2-stage estimations. In particular, the first stage conducts a global regression using a 2D CNN. Based on this initial result, local refinement was independently executed for each joint to facilitate accuracy. SRN[72] proposed a simple and effective method for 3D hand pose estimation from a 2D depth image by stacking multiple differentiable re-parameterization modules, which construct 3D heatmaps and unit vector fields from joint coordinates directly. Among the reasons for the improved performance of multi-stage approaches compared with their single-stage counterparts are the use of intermediate supervision. This provides strong auxiliary information to guide deep neural network tuning. Meanwhile, the prediction results from the previous stage can be regarded as the output anchors for the upcoming stage. Thus, the output space search complexity can be essentially compressed. In other words, the challenging nonlinear 3D hand pose estimation task has been divided into relatively simple sequential subtasks.
4.6 Discriminative approaches: Ensemble prediction
Nearly all the existing deep learning methods for 3D hand pose estimation are based on neural networks, which are facilitated by a neural network's strong nonlinear fitting capacity, but which also suffer from intrinsic overfitting risks. The model ensemble is a widely used machine-learning technology to alleviate overfitting for performance enhancement[73,74]. The ensemble learning idea was also applied to 3D hand pose estimation. One attempt is to decompose the pose parameters into three per-pixel estimations, that is, 2D heat maps, 3D heat maps, and unit 3D directional vector fields[8]. To estimate the 2D and 3D output jointly during the training phase, the 2D and 3D geometric characteristics within the depth image can be simultaneously considered to leverage performance. For online inference, the mean shift approach aggregates all the pixel-wise estimation evidence into holistic 3D joint coordinates in an ensemble manner. Hand branch ensemble (HBE)[75] proposed an ensemble of features through a three-branch network structure. A2J[48] aggregates the pose estimation results from densely set anchor points via soft voting for better generalization. JGR-P2O[76] decomposed the 3D hand pose into the joints' 2D image plane coordinates and depth values. The estimation model predicts these parameters in an ensemble manner, which is acquired by the end-to-end trainable pixel-to-offset module. Recently, 2D directional unit vector fields and closeness heatmaps were used by AWR[77] to enhance joint offset prediction. The paradigm of predicting the joint offset instead of coordinates can actually help to simplify 3D hand pose estimation with wide-range applications[36,51,59,62-67].
Another ensemble learning stream involves conducting a model ensemble by post-processing. The ensemble strategies include training epoch ensembles[51], backbone network ensembles, and test time augment[62].
4.7 Discriminative approaches: Synthetic data
Generally, all deep learning-based 3D hand pose estimation approaches suffer from data-hungry problems that weaken generalization. One feasible way of addressing this is to use synthetic data to avoid high-burden human annotation work because the 3D joint positions in synthetic data are known in advance. However, the artifacts of synthetic data make them distinguishable from real data. This is because the deep learning models that are trained on synthetic data cannot be generalized to real data. To address this, a generative adversarial network (GAN)[78] was applied to enhance the realism of synthetic data[79]. It is proposed that domain adaptation between synthetic and unlabeled real data be conducted by training an end-to-end network[80]. MANO[81] is a recently proposed parameterized synthetic data-generating model learned from approximately 1000 high-resolution 3D scans of hands in a wide variety of hand poses. It has also been successfully applied to the HANDS19 challenge[82]. Domain transfer between synthetic and real data is also achieved by learning the feature mapping of the two domains[83]. During this phase, feature learning and pose estimation are jointly considered.
4.8 Discriminative approaches: Weakly supervised, self-supervised, and semi-supervised learning
Another way to alleviate the data-hungry problem is to conduct weakly supervised or self-supervised learning. This tends to make better use of unlabeled or weakly labeled data. To this end, a depth regularizer is proposed to apply a fully annotated synthetic dataset to a weakly labeled real-world dataset[84]. In particular, depth maps are generated from the predicted 3D pose, and serve as weak supervision for 3D pose regression. However, self-supervised learning is conducted to exploit the complementary benefits of unsupervised model-fitting and discriminative approaches by designing differentiable model-fitting terms[85]. SO-HandNet[86] constructed a weakly supervised pose estimator by exploiting a hand feature encoder to extract multi-level features from the hand point cloud, and then fuses them to regress the 3D hand pose. CrossingNets[87] modeled the statistical relationships between 3D hand poses and corresponding depth maps using two deep generative models with a shared latent space. It runs in a semi-supervised learning manner, and is leveraged by unlabeled data.
5 Depth datasets
In deep learning pipelines, datasets are as important as the model itself. Models leaning towards a broader task usually require datasets that are as comprehensive as possible to cover more real-world possibilities, while more specialized models can sometimes achieve better results if the training data are also specialized to minimize interference. Consequently, numerous datasets are constructed for hand pose estimation, some of which are captured from real-world subjects with depth cameras, while others are synthesized. In this section, we list depth datasets ordered by year in Table 1, introduce several popular depth datasets, and discuss some solutions to improve data and datasets.
Depth datasets
DatasetJointsNo. SubjectsNo. FramesRGBYear
ContactPose[88]21502.9 MillionYes2020
SynHand5M[91]MeshN/A5 MillionNo2018
BigHand2.2M[92]21102.2 MillionNo2017
Hands in Action[98]N/AN/AN/ANo2014
5.1 NYU hand pose dataset
One of the most famous depth-based HPE datasets is the NYU Hand Pose Dataset (NYU). A total of 72,757 depth frames were captured on a single person for the training dataset, whereas 8252 were captured on two people for the test dataset. Three Kinects were used to capture each frame, one from the front and two on the sides[1].
5.2 ICVL dataset
The ICVL dataset was captured using a creative interactive gesture camera on 10 subjects with different hand sizes, and it was then manually refined on the basis of "Dynamics based 3D skeletal hand tracking." As a result, the original ICVL dataset includes 180,000 annotated training frames and 1000 testing frames, as well as more than 330,000 frames combined after the update[99].
5.3 MSRA15 dataset
MSRA15, or the MSRA hand gesture database, captures 17 different gestures on the right hands of nine test subjects using a Creative Interactive Gesture Camera. With roughly 500 frames saved for every gesture, the dataset contains 76,375 frames in total, with 21 joints from a third-person perspective[97].
5.4 BigHand2.2M dataset
Used by the well-known 2017 Hands in the Million Challenge, BigHand2.2M is different from traditional synthetic or manually annotated datasets in its usage of a tracking system with multiple six-dimensional (6D) magnetic sensors attached to the fingers in conjunction with a depth camera to automatically capture annotated frames. Without the need to annotate by hand, the BigHand2.2M dataset contains 2.2 million depth frames, and is one step closer to covering the full range of human hand articulation[92].
5.5 Rendered handpose dataset (RHD)
Datasets that are built using real data often suffer from insufficient data variations and inaccurate annotations. To complement these real datasets, the rendered handpose dataset (RHD) uses 3D models rendered using Blender to create accurate synthetic data in their attempt to cover rare poses with better accuracy. As a result, the RHD includes 20 character models and 39 actions, and it offers 41,258 training frames as well as 2,728 testing frames[15].
5.6 Dataset improvement solutions
Several metrics should be considered when qualitatively evaluating a depth dataset for hand pose estimation. The first metric is the number of frames that are included in the depth dataset; the more unique are the frames used in training, the better is the training result in general. The same can be said for the training process of most deep learning models.
The second metric is the variety of hand poses. The human hand is a high DoF articulation model that can perform almost countless poses, some of which are more frequently used than others. Rare hand poses that are hardly seen and captured in the training data cannot be learned well by the model; hence, a lack of hand pose variety in training will lead to poor performance on test data or real-world applications. In addition, human hands can also interact with various kinds of objects and appear in different backgrounds in the training dataset, so it is also important to have a variety of objects or backgrounds, if they are present.
Another metric is the quality or accuracy of data annotation. Many datasets and models consider the human hand as an articulation model of fingers and palm; in essence, groups of bones. In contrast, when the dataset is not synthetic, RGB or depth images captured for training see only the human skin and not the bones. This leads to the problem of annotation inaccuracy because a human or machine annotator (with data gloves) does not have ground truth bone positions, and can only provide an estimation. Moreover, different human data annotators can have different estimations, and it is challenging to establish a standard.
The final metric that we discuss is synthetic versus real datasets. As mentioned above, real depth datasets have the inherent problem of a limited number of data frames, limited rare hand poses, and are prone to inaccurate annotations. In comparison, synthetic datasets that are created in simulated environments can have many more data frames as well as hand poses, and can present annotations under the same standards that are closer to the ground truth. However, synthetic datasets may differ from real data, resulting in poor generalization capability for the model.
In response to the above metrics, several data augmentation and dataset improvement solutions have been used in practice.
Acquisition of more real data. One straightforward way to improve any dataset is to collect more data. One of the most important characteristics of a dataset is the number of samples in the training and test sets. In addition, because real data are usually captured using depth cameras in video format or streams of depth images, it is no difficult to simply amass raw depth data frames on the human hand. However, it may be difficult to acquire many high-quality frames that are suitable for hand-pose estimation training.
The first is the requirement for variety and diversity. The human hand itself varies significantly from person to person with respect to its size, length, width, build of fingers and palm, muscles, etc., and finding enough test subjects that reasonably cover all possible differences can be difficult. In addition, some tasks require object interaction or background noise cancelation; hence, numerous types of objects, poses, and backgrounds need to be used when building datasets for these tasks. As a result of this difficulty, even though some existing depth datasets contain up to several hundreds of thousands or even millions of frames, such as ICVL[99] and BigHand2.2M[92], the most well-known real datasets on HPE using depth images are captured on less than a dozen unique test subjects, with the exception of ASTAR datasets using 30 volunteers[101].
The second difficulty associated with collecting more real data is the annotation process. In many computer vision datasets, data annotations are completed by human data annotators, so the construction of large-scale datasets requires significant time and capital investments. However, an advantage that is unique to collecting hand data is the availability of data gloves and magnetic sensors, which can collect the positions of joints or fingers. To this end, several existing depth datasets, such as ASTAR[101], HandNet[97], FHAD[14], and BigHand2.2M[92], use this technology to generate or assist the process of making annotations. However, assisted methods such as data gloves and magnetic sensors also have disadvantages, such as the high cost of equipment, inaccurate annotations, and camera obstruction. Because of the two difficulties listed above, the annotated large-scale datasets are relatively limited.
Synthetic data. Synthetic data is a potential solution to the difficulty of acquiring an annotated real dataset with high quality. In contrast to real depth HPE data that are captured using depth cameras and test subjects, synthetic data are usually captured in a simulated environment with 3D hand models and virtual depth cameras. Synthetic datasets have the advantage of almost unlimited data capture and annotation capability, with diverse sets of virtual test subjects, objects, and backgrounds. However, virtual 3D models cannot yet fully match the change in shape during hand articulation, and synthetic depth images are still different from real inputs.
Several depth-exclusive HPE datasets use synthetic data. SynHand5M[91] is a synthetic dataset that focuses on hand shape and pose reconstruction, and it provides depth hand data that are abundantly diverse in shapes and poses, more so than some existing real datasets. Using generated hand models and virtual cameras, SynHand5M[91] can provide 4.5 million training frames as well as 0.5 million. The MSRC[95] dataset is another synthetic dataset that uses a different approach, where the prior distribution of hand poses is examined and used to prevent the generation of impossible poses, and ensures that common poses have the best coverage. Using hand mesh models and computer renderers, MSRC can offer 100k annotated depth images for training.
RGB-D datasets that use synthetic data to capture both RGB and depth frames are more common. Ego3DHands[89], SynthHands[90], and RHD[15] all use realistic 3D models and computer renderers to capture RGB images and depth maps, and they use the model to extract ground truth annotations.
Parallel to 3D model renderings, model-based generative methods employ adversarial networks to generate high-fidelity synthetic depth images. One example is SimGAN, which uses "Simulated+ Unsupervised (S+U) learning" to fit a model that uses unlabeled real data to improve the likelihood of synthetic datarelative to that of real data. Using real hand depth maps from the NYU[1] hand pose dataset, SimGAN[79] successfully added realistic noise to synthetic frames to better imitate imperfect real frames that are captured by depth cameras.
Data augmentation and repositioning. One of the most commonly used solutions to limited raw data and annotations is data augmentation. Popular computer vision frameworks have hundreds of millions of parameters, which in turn require datasets with more than one million images to train. Suffering from the constraints discussed above, real depth datasets for hand pose estimation often do not have such privileges, and the need for data augmentation is therefore especially important.
Traditionally, data augmentation employs several augmentation techniques, such as rotation, scaling, flipping, cropping, translation, multi-view projection, and random erase. More recently, model-based augmentation methods have emerged, such as HandAugment[82], which takes a depth image input and uses a neural network to augment the image to crop out unwanted regions, such as the forearm, to improve hand pose estimation accuracy.
Another possible approach is to repurpose data by converting other datasets into a depth dataset for hand pose estimation. HUMBI[103] is one such multi-view mesh human body dataset that has 772 vastly different human data samples. Using 107 spatially calibrated cameras, the dataset captures high-accuracy human pose representations in mesh geometry that can be modified for hand pose estimation using depth data.
6 State-of-the-art RGB image-based approaches
The development of depth-based methods has led to rapid development in 3D hand analysis. However, because depth sensors are usually expensive and limited by energy consumption, and high-quality depth images can mainly be captured in indoor environments because of their inapplicability in unconstrained outdoor environments, we have also witnessed an increasing number of studies that focus on much more affordable and convenient RGB-based solutions for 3D hand shape or pose estimation. Although RGB-based 3D hand analysis is a much more challenging task and an inherently ill-posed problem owing to the increased 2D-to-3D depth and scale ambiguities, existing works have shown promising results in many aspects for 3D hand reconstruction. Similar to the development of depth image-based methods, starting from generative approaches, the research focus has shifted towards data-driven approaches or a combination of data-driven approaches with a model-driven approach.
6.1 Generative approaches
Similar to the generative approaches introduced for depth-based methods in Section 4.1, we directly fit the 3D generative hand model to input observations with iterative optimization in order to reconstruct the 3D hand shape or pose. In contrast, here we consider the reconstruction of a 3D hand from a single monocular RGB frame, which cannot be directly used as the fitting object. Thus, multiple attempts follow a similar paradigm fitting generative hand model to 2D or 3D joint locations predicted from input RGB frames for 3D hand reconstruction[93,95,104-109].
6.2 Discriminative approaches: 2D-to-3D lifting
To overcome serious 2D-to-3D ambiguities and to model complex articulation for 3D hand pose estimation given monocular RGB images, conventional methods adopt a 2D-to-3D lifting paradigm using estimated 2D modalities, such as 2D pose and heatmap for direct 3D hand pose prediction.
Following this pipeline, Zimmermann and Brox[15] introduced the very first learning-based multi-stage framework, which includes hand segmentation, 2D heat-map estimation, and 2D-to-3D lifting step for root-relative and scale-invariant 3D hand pose estimation. To further improve the anatomical plausibility of the estimated hand pose, Mueller et al. extended this pipeline by adding a kinematic 3D hand model, aiming to fit the estimated 2D and 3D hand pose predictions to enforce biomechanical feasibility[105]. Several geometrical constraints were also adopted to further impose human hand structural constraints to alleviate the 3D hand parsing ambiguity. Instead of directly regressing 3D hand poses from an estimated 2D heatmap that tends to lose detailed spatial information, Iqbal et al. introduced a novel intermediate 2.5D pose representation that can be easily estimated from RGB frames[17]. They retain the 2D spatial resolution, and thus can fully explore the local evidence for more precise joint localization, leading to a large boost in accuracy.
6.3 Discriminative approaches: Cross-Modality
Previous studies have mainly relied on pairs of input RGB images with ground truth hand poses for training, despite the availability of multiple corresponding data sources such as depth maps, point clouds, and segmentation masks. Owing to the rapid growth of multimodal datasets, information across corresponding modalities can be utilized for enhanced learning. Thus, to fully exploit and relate information across various hand modalities in existing datasets, several studies have proposed learning a cross-modal statistical hand model that is represented by a unified latent space using deep generative models.
Spurr et al. aimed to learn a single shared latent space across multiple hand modalities based on the proposed cross-modal training scheme[110]. They enforce corresponding hand modality embeddings to lie in the same position in the shared latent space by alternating between decoding into different modalities. Various hand modalities, such as 3D pose, RGB images, and depth images are utilized for implementation. However, directly learning a shared latent space tends to cause slow convergence and compromise on performance. Thus, Yang et al. proposed the use of one latent space for each input hand modality, and then aligned their associated latent space with a joint space[111]. To improve the results, they proposed two ways of aligning these latent spaces, via KL divergence loss and the product of Gaussian experts. Theodoridis et al. further introduced a novel cross-modal variational alignment strategy to associate the probability distributions of latent spaces corresponding to different modalities for further improvements[112].
Because of the nature of cross-modal training, this line of work also provides a straightforward way of realizing a semi-supervision or weak-supervision learning paradigm. Moreover, the use of deep generative models such as the variational auto-encoder (VAE) [113] encourages a smoother and more consistent latent space, and thus can be used for data generation.
6.4 Discriminative approaches: Disentanglement
To better analyze various factors of variation ranging from the background to camera viewpoint for better 3D hand pose estimation and hand image synthesis, some studies have also focused on disentangled representation learning given monocular RGB frames. Yang et al. adopted VAE as a basic structure to learn one unified latent space that is composed of sub-spaces calculated by factors, and a training strategy to fuse different latent spaces into one disentangled latent space[114]. The proposed disentangled VAE (dVAE) can disentangle hand poses and other factors, such as the image background and viewpoint.
6.5 Discriminative approaches: Model-based 3D hand reconstruction
Recently, for more practical applications, a fast-growing body of research literature has focused on dense 3D hand mesh reconstruction in addition to sparse 3D hand joint localizations because hand shape surfaces offer a richer representation and enable one to directly reason occlusions and contact points with the surrounding objects or the environment. The 3D hand pose can still be obtained using the reconstructed hand surface via linear regression. Existing approaches can be divided into two types, namely the model-based and model-free approaches.
Model-based approaches have attracted significant attention owing to parametric 3D hand models, such as the recently proposed MANO hand model proposed by Romero et al.[81]. As a differentiable 3D hand deformation model, MANO parameterizes a triangulated 3D hand mesh with a set of shape and pose parameters. A number of recent methods have integrated MANO into end-to-end deep learning algorithms in order to extract hand shape and pose factors from input RGB frames[70,115-120]. The key idea is to leverage such a well-constructed hand model as a strong prior knowledge to resolve ambiguities and to obtain geometrically plausible 3D hand shape surfaces and poses from monocular RGB frames.
Multiple attempts follow a similar pipeline for 3D hand shape reconstruction, which uses a relatively simple regression framework to directly estimate MANO parameters from input images[70,116,119,120]. To achieve much finer reconstruction results, some attempts have adopted the classic coarse-to-fine iterative regression strategy for MANO parameter estimation from input images[117,120]. Differentiable re-projection modules or neural renderers are also employed to re-project estimated 3D results into a 2D image domain for richer supervision, which can mitigate the need for 3D mesh data[117,118,120]. Zhou et al. proposed to fully utilize available data modalities as well as non-image MoCap data to estimate accurate 3D joint locations and corresponding rotation representations[121]. The 3D hand surface can then be estimated by fitting MANO to the predicted joint locations or joint rotations in an axis-angle representation.
6.6 Discriminative approaches: Model-free 3D hand reconstruction
The MANO model can be limited when representing complex hand shape variations owing to the use of linear bases for hand reconstruction. As another line of work for 3D hand surface reconstruction, model-free approaches directly estimate the mesh vertex coordinates instead of regressing the MANO parameters to generate more flexible and precise hand meshes.
Because recent geometric deep learning algorithms have shown remarkable performance in 3D analyses, and hand meshes are essentially graph-structured data, several works have proposed similar encoder-decoder schemes based on graph convolutional neural networks (graph CNNs), aiming to fully exploit the topological relationship among mesh vertices. As the first model-free approach, Ge et al. proposed the application of a spectral-based graph CNNs with pre-defined upsampling operation and graph topology of a triangular hand mesh for estimating the 3D vertex coordinates in the hand mesh, given embedded image features[16]. Following a similar paradigm, Kulon et al. adopted a recently developed spatial graph CNN and a spiral convolutional operator as a mesh decoder. However, the embedded 1D image features used as the input to the graph CNN tend to lose important 2D spatial information[122]. To better preserve the spatial relationship between pixels while modeling the uncertainty of prediction, Moon et al. proposed a novel lixel-based 1D heatmap representation for hand surface generation, which defines quantized cells in each dimension for each mesh vertex coordinate in 3D space[123]. The designed image-to-lixel prediction model estimates the per-lixel likelihood on 1D heatmaps for joint and vertex localization.
6.7 Discriminative approaches: Weakly-supervised and semi-supervised learning
The increased number of real-world datasets with complete 3D annotations has been attributed to the fast growth of 3D hand shape and pose estimation, but it is always tedious, time-consuming, and expensive to acquire accurate 3D annotations for either a hand shape surface or hand pose. The sole reliance on large-scale synthetic datasets tends to fail on real-world datasets owing to the domain gap.
To alleviate the burden of 3D annotations for 3D hand pose estimation, Cai et al. proposed a weakly supervised adaptation method by bridging the gap between a fully annotated synthetic dataset and a weakly labeled real-world dataset using easily accessible depth maps[84]. With the help of a proposed depth regularizer, estimated 3D hand poses can be converted into corresponding depth maps that serve as weak supervision and can compensate for the scarcity of 3D annotations for real-world data. Another line of work that can be easily turned into a weakly supervised or semi-supervised learning paradigm is the cross-modal framework, as mentioned in Section 6.3[110-112]. Both the cross-modal training pipeline and alignment-based scheme can leverage data with weak labels to serve as complementary information for aiding the training process. However, proxy objectives cannot guarantee the resolution of 2D-to-3D depth and scale ambiguity and may generate physically implausible 3D hand poses. To maximize the use of weakly labeled real-world data for performance improvements, Spurr et al. used a set of biomechanically inspired constraints that require the estimated 3D pose to lie within the range of anatomically plausible 3D hand configurations[18].
With respect to 3D hand shape surface reconstruction, there is a higher demand for designing reasonable proxy objectives because of the difficulty with manually annotating the ground truth dense hand mesh vertices. Several works follow similar pipelines using differentiable re-projection modules or neural renderers to provide proxy supervision on the 2D image domain because rich cues about 3D annotations are contained in these 2D labels[16,102,120,122] . Kulon et al. introduced a fast annotation method that can be used as a source for weak supervision[122]. They rely on fitting a given deformable hand model to 2D hand joint detections of input images to generate 3D hand shape surfaces as pseudo-ground truth. Thus, without annotating images manually, the model can be trained using direct 3D hand mesh reconstruction loss.
6.8 Discriminative approaches: Sequential modeling and tracking
Although 3D hand shape and pose estimation from input RGB images has attracted significant attention, most of the existing work has focused on frame-by-frame estimation based on independent static images. Thus, several works not only consider the visual appearance of each input image, but also consider the temporal features of the hand in motion.
To better handle the common 2D-to-3D ambiguities and model the hand articulation for hand pose estimation, Cai et al. proposed building a spatiotemporal graph and using spectral-based GCNN to effectively incorporate spatial dependencies and temporal consistencies given 2D joint detections[124]. Following this pipeline, SeqHAND[119] first adopted the convolution-LSTM model to learn spatiotemporal information from synthetic sequential images. Then, during domain fine-tuning from synthetic data to real data, the recurrent layer is detached to preserve the spatio-temporal features learned from sequential synthetic hand images. Fan et al. instead proposed leveraging temporal information to improve the computation efficiency[125]. Based on temporal coherence, they utilized a Gaussian kernel gate module to dynamically switch between a light model and a heavy network to reduce computation consumption. Han et al. introduced the first hand-tracking system that can efficiently track different environments and users[126]. They first used the proposed detection-by-tracking system to extrapolate the current hand pose from previous tracked poses to generate a hand-detection bounding box. Then, cropped hand images are fed into a keypoint estimation model along with the extrapolated hand pose to provide a tracking history. Both strategies improve both temporal smoothness across frames and spatial consistency across multiple views, leading to an impressive performance.
7 State-of-the-art RGB image-based hand-object interaction modeling
While it is very important, recovering 3D hand surface or pose during interaction with an object is quite difficult. In addition to the common issues faced with articulated pose estimation, including complex pose variations, depth and scale ambiguities, self-occlusions, and background clutter, the complex HOI scenarios introduce two new challenges: severe occlusion between the hand and manipulated object and the physical constraints between them. Thus, to solve this problem, we should focus on how to understand the complex interaction between the hand and the manipulated object.
7.1 Generative approaches
Early methods that were developed to track hands that interact with objects focus on fitting a given 3D generative hand model to RGB-D images[94,108,127]. In addition to minimizing the proposed alignment energy function, Sridhar et al.[94] and Tsoli et al.[108] also adopted regularizers to ensure the plausibility of hand-object interactions.
Multicamera systems have also been explored[3,128]. However, model-based hand trackers tend to suffer from model drift, which limits their potential applications.
7.2 Discriminative approaches: Joint estimation
As hands manipulate objects, the interactions could impose strong correlations and constraints between the hand and object. For example, different object poses and shapes usually constrain the type of hand grasp, while the hand pose can also provide hints on the object pose and shape information. Hand-grasping rigid objects require contact that occurs at the surface without interpenetration. Thus, jointly accounting for the presence of the hand and object should help to solve serious issues such as articulation, depth and scale ambiguities, and occlusions, thus leading to more precise and feasible 3D hand reconstruction.
Romero et al. first proposed the joint estimation of hands and objects by querying a large synthetic database of rendered hands manipulating objects to retrieve the best configurations that match the visual evidence[129]. Choi et al. directly shared captured hand-centered and object-centered information between two networks to learn a better representation for 3D hand pose estimation[130]. Tekin et al. proposed a unified 3D YOLO-based single-shot framework for jointly regressing a 3D hand-object pose, object recognition, and activity classification in a single feed-forward pass[131]. Learning such a joint representation for simultaneously estimating 3D articulated and rigid poses while determining the action that the subject is performing requires an understanding of the complex interactions between the hand and object. Moreover, useful knowledge across these relevant tasks can be shared to help improve performance for one another. However, learning a shared feature space across related tasks cannot fully exploit the fine-grained correlations between the hand and object, potentially leading to physically invalid results. Hasson et al. instead focused on the plausibility of hand-object physical contact in an effort to model a valid interaction scenario[116]. Based on a bisected framework for the joint reconstruction of 3D hand and object meshes, they further adopted a novel contact constraint as regularization, including repulsion loss for penalizing penetration and attraction loss to encourage contact between the hand and object. While it is important to regularize the physical contact for valid interaction, it is also essential to adaptively model the inherent correlations between the hand and object because they are highly correlated. Doosti et al. and Huang et al. explicitly modeled the dependencies among hand joints and object bounding box corners using graph convolution and the attention mechanism, respectively[132,133]. Both mechanisms can adaptively exploit inter-joint relationships to generate kinematically feasible hand-object pose results.
7.3 Discriminative approaches: Weakly supervised
Recent research efforts for 3D hand shape and pose estimation under HOI scenarios have focused on fully supervised approaches that require a large number of fully annotated training images. However, the variations of existing real HOI datasets are limited because of the constrained acquisition environments, and the acquisition of high-quality 3D mesh or pose annotations for HOI is also expensive, tedious, and error-prone. The use of current large-scale synthetic datasets as div cannot ensure satisfactory fidelity and realism for generalization.
To mitigate the need for constructing large fully annotated HOI datasets, Hasson et al. proposed leveraging the temporal consistency between neighboring frames in HOI videos given sparsely annotated RGB HOI videos for the joint reconstruction of hands and objects[116]. Based on the 3D displacement between neighboring hand-object vertices, they differentiably render the optical flow and define a warping function to warp one image to the next. A photometric loss based on visual consistency between successive warped images can then be applied as self-supervision to effectively propagate information from the annotated frames to unlabeled ones. To further reduce the number of 3D annotations for HOI scenarios, Baek et al. presented a novel domain adaptation framework to leverage the existing fully labeled hand-only datasets that can be easily accessible[115]. The designed framework adapts the hand-object domain to the corresponding de-occluded hand-only domain via a mesh renderer for filling occluded hand pixels and GAN for further hand alignment.
8 RGB datasets
RGB datasets
DatasetJoint/MeshNo. SubjectsNo. FramesObjectDepthYear
InterHand[134]21272.6 MillionNoNo2020
ContactPose[88]21/Mesh502.9 MillionYesYes2020
CMU Panoptic HandDB[138]21N/A14,817NoNo2017
8.1 STB dataset
The STB dataset features the sequences of a single subject's left hand and is obtained under six real-world indoor backgrounds with different illumination conditions. Moreover, it contains 18k images with the ground truth of 21 3D hand joint locations and corresponding depth maps.
8.2 FHAD dataset
This is a recently published hand video collection, and it provides a variety of hand-object interactions, including 1175 videos with 45 types of activities and 6 subjects from an egocentric viewpoint. Visible magnetic sensors are strapped on the hands, which may help to automatically annotate 3D joints. A subset of this dataset contains the ground-truth 3D mesh and 6D object pose for four objects (juice bottle, liquid soap, milk, and salt). In total, 10 different action categories were involved in this subset.
8.3 ObMan dataset
ObMan is a large-scale dataset of synthetically generated hand-object interaction images with 141k training samples, 6.4k validation samples, and 6.2k testing samples. Each image is generated by rendering a given 3D hand mesh and eight everyday object models from ShapeNet[140].
8.4 HO-3D dataset
This dataset is a recently published hand-object interaction dataset that contains sequences with hands interacting with objects from a third-person point of view. It is the first markerless hand-object dataset of color images with 77k annotated frames, corresponding depth maps, 65 sequences, 10 persons, and 10 objects. The training set contained 66k frames with 3D pose annotations for both hands and objects. In the testing set with 11k frames, the hands are only annotated with the 3D location of the wrist.
8.5 FreiHAND dataset
FreiHAND is a recently released dataset containing 130k training and approximately 4k testing samples with the MANO hand shape and pose parameters as 3D hand mesh annotations. Because the annotations are not available for testing samples, evaluations are performed by anonymously submitting predictions to the online server.
9 Conclusions
In summary, numerous methods have already increased the hand shapes and pose estimation performance in controlled environments. In some cases, current state-of-the-art methods have already achieved a low estimation error (< 10mm) that enabled practical application scenarios. However, as discussed in Section 2, challenges such as articulation, orientation, occlusion, self-similarity, depth and scale ambiguities, noise, clutter, and limited datasets are still inevitable and require more attention. The problem is far from being solved in terms of both effectiveness and efficiency.
In the near future, data-efficient methods such as weakly supervised and domain-transferring approaches are promising and have good potential for deep learning (DL)-based methods. Data hungry and poor generalization ability are common issues for DL-based methods. Previous works have aimed to mitigate this problem using large-scale annotated datasets[92]. Weakly supervised methods that take only the raw input depth images also show promising performance, even when compared with newly proposed supervised methods[85].
Another line of work attempts to model the inherent articulation of joints by incorporating hand geometry constraints into loss functions[2,66-69] and utilizing novel network structures such as transformers[67]. This is an interesting way to strengthen the inference ability of approaches, thus achieving more precise predictions under self-occlusion and clutter settings.
There is a need for more research to solve the most fundamental issues in this 3D pose estimation task, such as occlusion, depth, and scale ambiguities. It is also still necessary to determine how to effectively make the algorithm aware of the occlusion or understand the global hand scale.
Moreover, multi-modality methods are likely to achieve more accurate hand shape and pose estimations because depth and RGB inputs are mutually reinforcing[14,15,93,100]. This may enable wider application scenarios.
In addition, there is also increased interest in more complex and realistic cases, such as hand interactions with rigid or non-rigid objects. Understanding and modeling interactions between hands and objects are vital for analyzing and imitating human behavior, which could lead to many applications in human-computer interactions, virtual reality, augmented reality, robotics, etc.
In addition to working on monocular single depth or RGB frames, we could also foresee the trend of exploiting multiview geometry or sequential features to resolve various 2D-to-3D ambiguities as well as to enforce temporal smoothness, as current single frame-based pipelines tend to generate jittery results.



Tompson J, Stein M, LeCun Y, Perlin K. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics, 2014, 33(5): 169 DOI:10.1145/2629500


Zhou X, Wan Q, Zhang W, Xue X, Wei Y. Model-based deep hand pose estimation. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. New York, USA, AAAI Press, 2016, 2421–2427


Oikonomidis I, Kyriazis N, Argyros A A. Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: 2011 International Conference on Computer Vision. Barcelona, Spain, IEEE, 2011, 2088–2095 DOI:10.1109/iccv.2011.6126483


Qian C, Sun X, Wei Y C, Tang X O, Sun J. Realtime and robust hand tracking from depth. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 1106–1113 DOI:10.1109/cvpr.2014.145


de la Gorce M, Fleet D J, Paragios N. Model-based 3D hand pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1793–1805 DOI:10.1109/tpami.2011.33


, Heidelberg Berlin, Springer Berlin Heidelberg, 2011, 744–757 DOI:10.1007/978-3-642-19318-7_58


Xu C, Govindarajan L N, Zhang Y, Cheng L. Lie-X: depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, 2017, 123(3): 454–478 DOI:10.1007/s11263-017-0998-6


Wan C D, Probst T, Gool L V, Yao A. Dense 3D regression for hand pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5147–5156 DOI:10.1109/cvpr.2018.00540


Newell A, Yang K Y, Deng J. Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 483–499. DOI:10.1007/978-3-319-46484-8_29


Barsoum E. Articulated hand pose estimation review. 2016


Chen W Y, Yu C C, Tu C Y, Lyu Z H, Tang J, Ou S Q, Fu Y, Xue Z D. A survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors (Basel, Switzerland), 2020, 20(4): E1074 DOI:10.3390/s20041074


Ye Q, Kim T K. Occlusion-aware hand pose estimation using hierarchical mixture density network. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 817–834. DOI:10.1007/978-3-030-01249-6_49


Zhang J W, Jiao J B, Chen M L, Qu L Q, Xu X B, Yang Q X. A hand pose tracking benchmark from stereo matching. In: 2017 IEEE International Conference on Image Processing (ICIP). Beijing, China, IEEE, 2017, 982–986 DOI:10.1109/icip.2017.8296428


Garcia-Hernando G, Yuan S X, Baek S, Kim T K. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 409–419 DOI:10.1109/cvpr.2018.00050


Zimmermann C, Brox T. Learning to estimate 3D hand pose from single RGB images. 2017 IEEE International Conference on Computer Vision (ICCV), 2017, 4913–4921 DOI:10.1109/iccv.2017.525


Ge L H, Ren Z, Li Y C, Xue Z H, Wang Y Y, Cai J F, Yuan J S. 3D hand shape and pose estimation from a single RGB image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10825–10834 DOI:10.1109/cvpr.2019.01109


Iqbal U, Molchanov P, Breuel T, Gall J, Kautz J. Hand pose estimation via latent 2.5D heatmap regression. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 125–143 DOI:10.1007/978-3-030-01252-6_8


Spurr A, Iqbal U, Molchanov P, Hilliges O, Kautz J. Weakly supervised 3D hand pose estimation via biomechanical constraints. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 211–228 DOI:10.1007/978-3-030-58520-4_13


World Population Prospects-Population Division. United Nations. 2020


O'Dea. Forecast Number of Mobile Users Worldwide 2020-2024. 2020


O'Dea. Smartphone Users 2020. 2020


Giancola S, Valenti M, Sala R M. A survey on 3D cameras: metrological comparison of time-of-flight, structured-light and active stereoscopy technologies. Cham: Springer International Publishing, 2018 DOI:10.1007/978-3-319-91761-0


Intel® RealSense™ Technology. 2020


Kinect for Windows. Kinect-Windows App Development. 2020


ROS.org. Xbox 360 Kinect Internals. 2021


3D Camera Survey-ROS-Industrial. 2020


MYNT EYE Standard. MYNT EYE, www.mynteye.com/products/mynt-eye-stereo-camera.


Lu S, Metaxas D, Samaras D, Oliensis J. Using multiple cues for hand tracking and model refinement. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Proceedings. Madison, WI, USA, IEEE, 2003, II–443 DOI:10.1109/cvpr.2003.1211501


Bray M, Koller-Meier E, van Gool L. Smart particle filtering for high-dimensional tracking. Computer Vision and Image Understanding, 2007, 106(1): 116–129 DOI:10.1016/j.cviu.2005.09.013


Dundee, British Machine Vision Association, 2011 DOI:10.5244/c.25.101


Tkach A, Tagliasacchi A, Remelli E, Pauly M, Fitzgibbon A. Online generative model personalization for hand tracking. ACM Transactions on Graphics, 2017, 36(6): 1–11 DOI:10.1145/3130800.3130830


Delamarre Q, Faugeras O. 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding, 2001, 81(3): 328–357 DOI:10.1006/cviu.2000.0892


Poli R, Kennedy J, Blackwell T. Particle swarm optimization. Swarm Intelligence, 2007, 1(1): 33–57 DOI:10.1007/s11721-007-0002-0


Tagliasacchi A, Schröder M, Tkach A, Bouaziz S, Botsch M, Pauly M. Robust articulated-ICP for real-time hand tracking. Computer Graphics Forum, 2015, 34(5): 101–114 DOI:10.1111/cgf.12700


Taylor J, Bordeaux L, Cashman T, Corish B, Keskin C, Sharp T, Soto E, Sweeney D, Valentin J, Luff B, Topalian A, Wood E, Khamis S, Kohli P, Izadi S, Banks R, Fitzgibbon A, Shotton J. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics, 2016, 35(4): 1–12 DOI:10.1145/2897824.2925965


Oberweger M, Wohlhart P, Lepetit V. Hands deep in deep learning for hand pose estimation. In: Proceedings of 20th Computer Vision Winter Workshop (CVWW). 2015, 21–30


Oberweger M, Lepetit V. DeepPrior++: improving fast and accurate 3D hand pose estimation. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 585–594 DOI:10.1109/iccvw.2017.75


Guo H, Wang G, Chen X, Zhang C, Qiao F, Yang H J I. Region ensemble network: improving convolutional network for hand pose estimation. 2017


Guo H K, Wang G J, Chen X H, Zhang C R. Towards good practices for deep 3D hand pose.


Chen X H, Wang G J, Guo H K, Zhang C R. Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing, 2020, 395: 138–149 DOI:10.1016/j.neucom.2018.06.097


Ge L H, Liang H, Yuan J S, Thalmann D. Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 3593–3601 DOI:10.1109/cvpr.2016.391


Haque A, Peng B Y, Luo Z L, Alahi A, Yeung S, Li F F. Towards viewpoint invariant 3D human pose estimation. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 160–177 DOI:10.1007/978-3-319-46448-0_10


Toshev A, Szegedy C. DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 1653–1660 DOI:10.1109/cvpr.2014.214


Cao Z, Simon T, Wei S H, Sheikh Y. Realtime multi-person 2D pose estimation using part affinity fields. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1302–1310 DOI:10.1109/cvpr.2017.143


Wei S H, Ramakrishna V, Kanade T, Sheikh Y. Convolutional pose machines. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 4724–4732 DOI:10.1109/cvpr.2016.511


Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 472–487 DOI:10.1007/978-3-030-01231-1_29


Xiong F, Zhang B S, Xiao Y, Cao Z G, Yu T D, Zhou J T, Yuan J S. A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 793–802 DOI:10.1109/iccv.2019.00088


Ge L H, Liang H, Yuan J S, Thalmann D. 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 5679–5688 DOI:10.1109/cvpr.2017.602


Deng X M, Yang S, Zhang Y D, Tan P, Wang H. Hand3D: hand pose estimation using 3D neural network. 2017


Chang J Y, Moon G, Lee K M. V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5079–5088 DOI:10.1109/cvpr.2018.00533


Charles R Q, Hao S, Mo K C, Guibas L J. PointNet: deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 77–85 DOI:10.1109/cvpr.2017.16


Qi C R, Yi L, Su H, Guibas L J. PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA, Curran Associates Inc, 2017, 5105–5114


Maturana D, Scherer S. VoxNet: a 3D Convolutional Neural Network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany, IEEE, 2015, 922–928 DOI:10.1109/iros.2015.7353481


Ge L H, Cai Y J, Weng J W, Yuan J S. Hand PointNet: 3D hand pose estimation using point sets. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 8417–8426 DOI:10.1109/cvpr.2018.00878


Li S L, Lee D. Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 11919–11928 DOI:10.1109/cvpr.2019.01220


Moon G, Chang J Y, Suh Y, Lee K M. Holistic planimetric prediction to local volumetric prediction for 3D human pose estimation. 2017


Wang K Z, Zhai S F, Cheng H, Liang X D, Lin L. Human pose estimation from depth images via inference embedded multi-task learning. In: Proceedings of the 24th ACM International Conference on Multimedia. Amsterdam, the Netherlands, New York, NY, USA, ACM, 2016 DOI:10.1145/2964284.2964322


Wang K Z, Lin L, Ren C J, Zhang W, Sun W X. Convolutional memory blocks for depth data representation learning. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm, Sweden, California, International Joint Conferences on Artificial Intelligence Organization, 2018 DOI:10.24963/ijcai.2018/387


Ge L H, Ren Z, Yuan J S. Point-to-point regression PointNet for 3D hand pose estimation. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 489–505 DOI:10.1007/978-3-030-01261-8_29


Pavlakos G, Zhou X W, Derpanis K G, Daniilidis K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1263–1272 DOI:10.1109/cvpr.2017.139


Armagan A, Garcia-Hernando G, Baek S, Hampali S, Rad M, Zhang Z, Xie S, Chen M X, Zhang B, Xiong F. Measuring generalisation to unseen viewpoints, articulations, shapes and objects for 3D hand pose estimation under hand-object interaction. 2020


Cham, Springer International Publishing, 2018, 536–553


Malik J, Abdelaziz I, Elhayek A, Shimada S, Ali S A, Golyanik V, Theobalt C, Stricker D. HandVoxNet: deep voxel-based network for 3D hand shape and pose estimation from a single depth map. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 7111–7120 DOI:10.1109/cvpr42600.2020.00714


Wu X K, Finnegan D, O'Neill E, Yang Y L. HandMap: robust hand pose estimation via intermediate dense guidance map supervision. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 246–262 DOI:10.1007/978-3-030-01270-0_15


Madadi M, Escalera S, Baro X, Gonzalez J. End-to-end global to local CNN learning for hand pose recovery in depth data. 2017


Huang L, Tan J C, Liu J, Yuan J S. Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 17–33 DOI:10.1007/978-3-030-58595-2_2


Lille, France, IEEE, 2019 DOI:10.1109/fg.2019.8756559


Zhou X Y, Sun X, Zhang W, Liang S, Wei Y C. Deep kinematic pose regression. In: Lecture Notes in Computer Science. Cham: Springer International Publishing, 2016, 186–201 DOI:10.1007/978-3-319-49409-8_17


Hasson Y, Varol G, Tzionas D, Kalevatykh I, Black M J, Laptev I, Schmid C. Learning joint reconstruction of hands and manipulated objects. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 11799–11808 DOI:10.1109/CVPR.2019.01208


Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA, Curran Associates Inc., 2017, 6000–6010


Ren P F, Sun H F, Qi Q. SRN: stacked regression network for real-time 3D hand pose estimation. In: The British Machine Vision Conference. 2019


Higuchi T, Yao X, Liu Y. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation, 2000, 4(4): 380–387 DOI:10.1109/4235.887237


Zhang L, Shi Z, Cheng M M, Liu Y, Bian J W, Zhou J T, Zheng G, Zeng Z. Nonlinear regression via deep negative correlation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(3): 982–998 DOI:10.1109/TPAMI.2019.2943860


Zhou Y, Lu J, Du K, Lin X, Sun Y, Ma X. HBE: hand branch ensemble network for real-time 3D hand pose estimation. In: Computer Vision–ECCV. Cham, Springer International Publishing, 2018, 521–536


Fang L, Liu X, Liu L, Xu H, Kang W. JGR-P2O: joint graph reasoning based pixel-to-offset prediction network for 3D hand pose estimation from a single depth image. In: Computer Vision–ECCV. Cham, Springer International Publishing, 2020, 120–137


Huang W, Ren P, Wang J, Qi Q, Sun H. AWR: Adaptive weighting regression for 3D hand pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 11061–11068


Montreal, Canada, Press MIT, 2014, 2672–2680


Shrivastava A, Pfister T, Tuzel O, Susskind J, Webb R. Learning from Simulated and unsupervised images through adversarial training. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017


Dibra E, Wolf T, Oztireli C, Gross M. How to refine 3D hand pose estimation from unlabelled depth data? In: 2017 International Conference on 3D Vision (3DV). 2017, 135–144 DOI:10.1109/3DV.2017.00025


Romero J, Tzionas D, Black M J. Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics (ToG), 2017


Zhang Z H, Xie S P, Chen M X, Zhu H C. HandAugment: a simple data augmentation for HANDS19 challenge task 1: depth-based 3D hand pose estimation. 2020


Rad M, Oberweger M, Lepetit V. Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018


Cai Y J, Ge L H, Cai J F, Yuan J S. Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 678–694 DOI:10.1007/978-3-030-01231-1_41


Wan C D, Probst T, Van Gool L, Yao A. Self-supervised 3D hand pose estimation through training by fitting. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10845–10854 DOI:10.1109/cvpr.2019.01111


Chen Y J, Tu Z G, Ge L H, Zhang D J, Chen R Z, Yuan J S. SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 6960–6969 DOI:10.1109/iccv.2019.00706


Wan C D, Probst T, Van Gool L, Yao A. Crossing nets: combining GANs and VAEs with a shared latent space for hand pose estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1196–1205 DOI:10.1109/cvpr.2017.132


Brahmbhatt S, Tang C C, Twigg C D, Kemp C C, Hays J. ContactPose: A dataset of grasps with object contact and hand pose. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 361–378 DOI:10.1007/978-3-030-58601-0_22


Lin F Q, Wilhelm C, Martinez T. Two-hand global 3D pose estimation using monocular RGB. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020


Malik J, Elhayek A, Nunnari F, Stricker D. Simple and effective deep hand shape and pose regression from a single depth image. Computers & Graphics, 2019, 85: 85–91 DOI:10.1016/j.cag.2019.10.002


Malik J, Elhayek A, Nunnari F, Varanasi K, Tamaddon K, Heloir A, Stricker D. DeepHPS: end-to-end estimation of 3D hand pose and shape by learning from synthetic depth. In: 2018 International Conference on 3D Vision (3DV). Verona, Italy, IEEE, 2018, 110–119 DOI:10.1109/3dv.2018.00023


Yuan S X, Ye Q, Stenger B, Jain S, Kim T K. BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 2605–2613 DOI:10.1109/cvpr.2017.279


Mueller F, Mehta D, Sotnychenko O, Sridhar S, Casas D, Theobalt C. Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 1284–1293 DOI:10.1109/iccvw.2017.82


Sridhar S, Mueller F, Zollhöfer M, Casas D, Oulasvirta A, Theobalt C. Real-time joint tracking of a hand manipulating an object from RGB-D input. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 294–310 DOI:10.1007/978-3-319-46475-6_19


Sharp T, Keskin C, Robertson D, Taylor J, Shotton J, Kim D, Rhemann C, Leichter I, Vinnikov A, Wei Y C, Freedman D, Kohli P, Krupka E, Fitzgibbon A, S.Accurate Izadi, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. Seoul Republic of Korea, New York, NY, USA, ACM, 2015 DOI:10.1145/2702123.2702179


Ge L H, Liang H, Yuan J S, Thalmann D. Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 3593–3601 DOI:10.1109/cvpr.2016.391


Swansea, British Machine Vision Association, 2015 DOI:10.5244/c.29.33


Tzionas D, Ballan L, Srikantha A, Aponte P, Pollefeys M, Gall J. Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision, 2016, 118(2): 172–193 DOI:10.1007/s11263-016-0895-4


Tang D, Chang H J, Tejani A, Kim T. Latent regression forest: structured estimation of 3D articulated hand posture. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3786–3793 DOI:10.1109/CVPR.2014.490


Rogez G, Supancic Iii J S, Khademi M, Montiel J, Vision D R. 3D hand pose detection in egocentric RGB-D images. In: Proc. European Conference on Computer Vision Workshops (ECCVW). 2014


Xu C, Cheng L. Efficient Hand pose estimation from a single depth image. In: Proc. International Conference on Computer Vision (ICCV). 2013


Sridhar S, Oulasvirta A, Theobalt C. Interactive markerless articulated hand motion tracking using RGB and Depth data. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2013


Yu Z X, Shin Yoon J, Lee I K, Venkatesh P, Park J, Yu J H, Park H S. HUMBI: a large multiview dataset of human body expressions. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 2987–2997 DOI:10.1109/cvpr42600.2020.00306


Joo H, Simon T, Sheikh Y. Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 8320–8329 DOI:10.1109/cvpr.2018.00868


Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C. GANerated hands for real-time 3D hand tracking from monocular RGB. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 49–59 DOI:10.1109/cvpr.2018.00013


Panteleris P, Oikonomidis I, Argyros A. Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, NV, USA, IEEE, 2018, 436–445 DOI:10.1109/wacv.2018.00054


Xiang D L, Joo H, Sheikh Y. Monocular total capture: posing face, body, and hands in the wild. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10957–10966 DOI:10.1109/cvpr.2019.01122


Tsoli A, Argyros A A. Joint 3D tracking of a deformable object in interaction with a hand. In: Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 504–520 DOI:10.1007/978-3-030-01264-9_30


Pavlakos G, Choutas V, Ghorbani N, Bolkart T, Osman A A, Tzionas D, Black M J. Expressive body capture: 3D hands, face, and body from a single image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10967–10977 DOI:10.1109/cvpr.2019.01123


Spurr A, Song J, Park S, Hilliges O. Cross-modal deep variational hand pose estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 89–98 DOI:10.1109/cvpr.2018.00017


Yang L L, Li S L, Lee D, Yao A. Aligning latent spaces for 3D hand pose estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 2335–2343 DOI:10.1109/iccv.2019.00242


Theodoridis T, Chatzis T, Solachidis V, Dimitropoulos K, Daras P. Cross-modal variational alignment of latent spaces. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, WA, USA, IEEE, 2020, 4127–4136 DOI:10.1109/cvprw50498.2020.00488


Kingma D P, Welling M. Auto-encoding varia-tional bayes. 2013


Yang L L, Yao A. Disentangling latent hands for image synthesis and pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 9869–9878 DOI:10.1109/cvpr.2019.01011


Baek S, Kim K I, Kim T K. Weakly-supervised domain adaptation via GAN and mesh model for estimating 3D hand poses interacting objects. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 6120–6130 DOI:10.1109/cvpr42600.2020.00616


Hasson Y, Tekin B, Bogo F, Laptev I, Pollefeys M, Schmid C. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 568–577 DOI:10.1109/cvpr42600.2020.00065


Baek S, Kim K I, Kim T K. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 1067–1076 DOI:10.1109/cvpr.2019.00116


Boukhayma A, de Bem R, Torr P H S. 3D hand shape and pose from images in the wild. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, IEEE, 2019, 10835–10844 DOI:10.1109/cvpr.2019.01110


Yang J, Chang H J, Lee S, Kwak N. SeqHAND: RGB-sequence-based 3D hand pose and shape estimation. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 122–139 DOI:10.1007/978-3-030-58610-2_8


Zhang X, Li Q, Mo H, Zhang W B, Zheng W. End-to-end hand mesh recovery from a monocular RGB image. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 2354–2364 DOI:10.1109/iccv.2019.00244


Zhou Y, Habermann M, Xu W, Habibie I, Xu F. Monocular realtime hand shape and motion capture using multi-modal data. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020


Kulon D, Güler R A, Kokkinos I, Bronstein M M, Zafeiriou S. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 4989–4999 DOI:10.1109/cvpr42600.2020.00504


Moon G, Lee K M. I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Computer Vision–ECCV 2020. Cham: Springer International Publishing, 2020, 752–768 DOI:10.1007/978-3-030-58571-6_44


Cai Y, Ge L, Liu J, Cai J, Cham T, Yuan J, Thalmann N M. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 2272-2281 DOI:10.1109/ICCV.2019.00236


Fan Z P, Liu J, Wang Y. Adaptive computationally efficient network for monocular 3D hand pose estimation. In: Proc. European Conference on Computer Vision (ECCV). 2020


Han S, Liu B, Cabezas R, Twigg C D, Wang R. MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Transactions on Graphics (TOG), 2020


Panteleris P, Kyriazis N, Argyros A A. 3D tracking of human hands in interaction with unknown objects. In: Proc. British machine vision conference (BMVC). 2015


Panteleris P, Argyros A. Back to RGB: 3D tracking of hands and hand-object interactions based on short-baseline stereo. In: Proc. IEEE International Conference on Computer Vision Workshops (ICCVW). 2017


Romero J, Kjellström H, Kragic D. Hands in action: real-time 3D reconstruction of hands in interaction with objects. In: 2010 IEEE International Conference on Robotics and Automation. Anchorage, AK, USA, IEEE, 2010, 458–463 DOI:10.1109/robot.2010.5509753


Choi C, Yoon S H, Chen C, Ramani K. Robust hand pose estimation during the interaction with an unknown object. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 3142–3151 DOI:10.1109/ICCV.2017.339


Tekin B, Bogo F, Pollefeys M. Recognition P. H+O: Unified egocentric recognition of 3D hand-object poses and interactions. 2019


Doosti B, Naha S, Mirbagheri M, Crandall D J. HOPE-Net: A Graph-based model for hand-object pose estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020 DOI:10.1109/CVPR42600.2020.00664


Huang L, Tan J, Meng J, Liu J, Yuan J. HOT-Net: Non-autoregressive transformer for 3D hand-object pose estimation. In: Proceedings of the 28th ACM International Conference on Multimedia. Seattle, WA, USA. Association for Computing Machinery, 2020, 3136–3145 DOI:10.1145/3394171.3413775


Cham, Springer International Publishing, 2020, 548–564


Wang Y, Peng C, Liu Y. Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(11): 3258–3268 DOI:10.1109/TCSVT.2018.2879980


Zimmermann C, Ceylan D, Yang J, Russell B, Argus M J, Brox T. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 813–822 DOI:10.1109/ICCV.2019.00090


Hampali S, Rad M, Oberweger M, Lepetit V. HOnnotate: A Method for 3D Annotation of Hand and Object Poses. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3193–3203 DOI:10.1109/CVPR42600.2020.00326


Simon T, Joo H, Matthews I, Sheikh Y. Hand keypoint detection in single images using multiview bootstrapping. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4645–4653 DOI:10.1109/CVPR.2017.494


Gomez-Donoso F, Orts-Escolano S, Cazorla M. Large-scale multiview 3D hand pose dataset. Image and Vision Computing, 2019, 81: 25–33 DOI:10.1016/j.imavis.2018.12.001


Chang A X, Funkhouser T, Guibas L, Hanrahan P, Yu F. ShapeNet: an information-rich 3D model repository. 2015