Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2021,  3 (3):   183 - 206

Published Date:2021-6-20 DOI: 10.1016/j.vrih.2021.05.001


In recent years, gesture recognition has been widely used in the fields of intelligent driving, virtual reality, and human-computer interaction. With the development of artificial intelligence, deep learning has achieved remarkable success in computer vision. To help researchers better understanding the development status of gesture recognition in video, this article provides a detailed survey of the latest developments in gesture recognition technology for videos based on deep learning. The reviewed methods are broadly categorized into three groups based on the type of neural networks used for recognition: two-stream convolutional neural networks, 3D convolutional neural networks, and Long-short Term Memory (LSTM) networks. In this review, we discuss the advantages and limitations of existing technologies, focusing on the feature extraction method of the spatiotemporal structure information in a video sequence, and consider future research directions.


1 Introduction
Human-computer interaction (HCI) technology[1]is used for the natural interaction between humans and machines. Traditional human-machine interaction is accomplished primarily through devices such as the mouse, keyboards, and screens. These devices generally need to be connected to a computer. It makes the interaction more easier to use these devices to operate a computer manually. Nevertheless, in some specific scenarios such as virtual reality (VR), remote command, and augmented reality (AR), it is still unconvienent to use these touching devices. Therefore, establishing a harmonious and natural HCI environment that conforms to human communication habits is a vital research topic.
At present, some intuitive and simple ways of HCI are slowly emerging, among which the most rapidly developing one is body movement recognition[2-4]. A long time ago when human communication started, intentions were expressed through body movement[5]. Later, as human communication developed, language became the most important form for human communication, with body movements still being an important assistant for expressing one's meaning.Recognition of body movement can be divided into gesture recognition[6,7]and action recognition[8-10]. The principles of gesture and action recognition are similar. Both refer to recognizing actions by analyzing and understanding user movement. However, gestures are mainly movements made by the hands, whereas body movements include movements performed by all parts of the body. As hands are a part of the human body, action and gesture recognition can be algorithmically linked. However, because of the different characteristics of the structures of the hand and the human body, gesture and action recognition are still different from each other.
Gesture is physical movements of the fingers, hands, arms, head, face, or body. In many vital scenarios, gestures are an essential way of communication, which allows computers to understand the performer's intention, thus it can be used to issue specific commands to these devices.
Gesture data can be obtained through wearable devices[11-13] or computer vision[14-16]. Gesture data obtained through special wearable devices, such as data gloves[17,18], can detect specific finger curvature information and spatial position information of the hand and arm. Gesture recognition is realized through the analysis of videos of user gestures and modeling. Both approaches have advantages and disadvantages. Wearable devices can capture more accurate data of hand changes and thus achieve higher recognition accuracy. However, this method requires special equipment that is not only expensive, but also inconvenient for daily use. In contrast, computer vision collects the user's hand gesture movement with a camera and no other equipment. Therefore, this method is more convenient. Further, gesture recognition using computer vision is simple and easy to use, but it also promotes the improvement of recognition accuracy. With an improvement in the computing power of hardware devices and the performance of recognition algorithms proposed by researchers, methods based on computer vision have gained increasing attention from researchers.
Gesture recognition has a wide range of applications in various HCI environments. In intelligent driving[19,20], the direct control of in-vehicle devices through gestures can improve control efficiency and reduce driving safety risks. For family entertainment[21], games controlled by gestures are gaining an increasing market share. In social welfare[22,23], sign language recognition allows people with disabilities to integrate into society. In addition, gesture recognition technology can be applied to important systems in fields such as aviation, education, and home automation[24-28].
Several survey papers have summarized the research on gesture recognition. Specifically, Escalera[29] and D'Orazio[30] reviewed the challenges and methods for gesture recognition using multimodal data. The review presented by Nyagaand Rautaray[31,32] focuses on the three main phases of hand gesture recognition i.e., detection, tracking, and recognition. Cheng et al. focused on 3D hand gesture recognition and thoroughly summarized the recent progress in static, trajectory, and continuous viewpoints[33]. Khan et al. paid attention to many applications and techniques with an explanation of the system recognition framework and its main phases[34]. However, the related technology is old. The review presented by Devi et al.[35] is a comprehensive investigation on automated dance gesture recognition with an emphasis on static hand gesture recognition. Gao et al. reviewed the methods proposed on the concept of dynamic maps in human motion recognition[36]. In contrast to the above literature that provides broad and general gesture recognition reports, we mainly focus on isolated dynamic gesture recognition in videos and summarize the recent progress in gesture recognition using different time modeling methods. In this review, we only focus on isolated dynamic gesture rather than continuous gestures. This survey covers most of the emerging hand gesture recognition approaches based on deep learning.
A key contribution of this survey is the focus on the three spatiotemporal architectures of neural networks used in various deep learning methods reviewed, namely two-stream-based, 3D-based, and LSTM-based. Figure 1 illustrates the taxonomy of this review. This review is dedicated to video-gesture recognition using deep learning. In addition, the advantages and limitations of the reviewed methods are discussed from the spatiotemporal information encoding viewpoint and potential directions for future research are suggested. This study distinguishes itself from others through the following contributions.
• Latest research and most advanced deep learning-based methods developed in recent years are comprehensively covered.
• An in-depth analysis and classification is performed on the methods covered in this review based on the manner that temporal and spatial information are analyzed, highlighting their advantages and disadvantages.
• The challenges of gesture recognition from videos, and the limitations of existing methods are analyzed and discussed.
• Additionally, several recently released or commonly used independent dynamic gesture datasets associated with deep learning are discussed.
The remainder of the paper is organized as follows: Section 2 summarizes related datasets on isolated dynamic gestures, and the third section briefly reviews related work on multimodal inputs. The fourth section details the work on dynamic gesture modeling, which includes both handicraft feature-based and deep learning-based approaches. The current challenges and specific application scenarios for dynamic gesture recognition algorithms are presented in Sections 5 and 6, respectively. Finally, we conclude the paper in Section 7.
2 Dynamic independent gesture dataset
Gesture recognition is a research hotspot in the field of HCI[37,38]. Depending on whether the gesture occurs at a certain moment or over a certain period, it can be classified as either a static gesture[39-41] or dynamic gesture[42-44]. Static gesture recognition uses a gesture image captured at a certain point in time, and its recognition result is related to the appearance of the hand in the image, such as position, contour, and texture. Dynamic gesture recognition uses an image sequence captured over a continuous period, and its recognition result is not only related to the appearance of the hand but also the temporal features describing the trajectory of the hand in the sequence. Compared with static gestures, dynamic gestures are more varied, more expressive, and more practical. For the purpose of this survey, six isolated gesture datasets that have been commonly adopted for evaluating deep learning-based methods are described. Table 1 shows the statistics of the publicly available datasets.
Statistics of the public available benchmark datasets that are commonly used for the evaluation of deep learning methods. (Notation: D: Depth, S: Skeleton, UM: User mask, #: number of)
Dataset Year Acquisition device Modality #Classes #Subjects #Samples #Scenes Metric
20BN-jester dataset 2019 Laptop camera or webcam RGB 27 1376 148092 - Accuracy
Montalbano dataset (V2) 2014 Kinect v1 RGB, D S, UM 20 27 13858 - Jaccard index
ChaLearn LAP IsoGD 2016 Kinect v1 RGB, D 249 21 47933 - Accuracy
DVS128 Gesture dataset 2017 DVS128 and webcam RGB 11 29 1342 1 Accuracy
SKIG 2013 Kinect v1 RGB, D 10 6 2160 3 Accuracy
EgoGesture dataset 2018 Intel RealSense RGB, D 83 50 24161 6 Accuracy
2.1 20BN-jester dataset
The 20BN-JESTER dataset 1 is a large collection of labeled video clips that show humans performing predefined hand gestures in front of a laptop camera or webcam[45], as shown in Figure 2. In the process of data acquisition, 1376 actors recorded a set of 27 actions. This resulted in a training set of 118562 videos, a verification set of 14787 videos, and a test set of 14743 videos, with a total of 27 tag types. As such, there is a wide variety of backgrounds and actors.
2.2 Montalbano dataset (v2)
The Montalbano dataset[46] is a vocabulary of 20 different Italian cultural anthropological signs 2 , performed by 27 different users with variations in surroundings, clothing, lighting, and gesture movement. The starting and ending frames for each gesture were annotated along with the gesture class label. The dataset consists of 7754 gestures for training, 3362 gestures for validation, and 2742 gestures for testing. The videos were recorded using Microsoft Kinect and consisted of RGB, depth, user mask, and skeleton/joint information for each frame of the video, as shown in Figure 3.
2.3 ChaLearn LAP IsoGD
The IsoGD dataset is the official dataset used in the ChaLearn LAP Large-scale Isolated Gesture Recognition Challenge contest 3 . The IsoGD dataset was derived from the Chalearn Gesture Dataset (CGD), which was expanded by Wan et al.[47] and others through collection with the Kinect device. This dataset is currently the largest gesture-recognition dataset. The entire dataset contains 47933 videos, with a total of 249 action types, and each sample has two kinds of information, RGB and depth. To accurately represent gestures in real life, the number of frames collected by the video is not fixed, and the official data are not unified. The shortest video is approximately nine frames, and the longest video has 405 frames. The resolution of each frame is fixed. The IsoGD dataset is divided into a training set, validation set, and test set with 35878, 5784, and 6271 videos, respectively. Each video corresponds to only one category, which means that the classification of a single gesture does not need the consideration of continuous gestures. Each gesture is shot in a natural scene, and in order to improve the robustness of the algorithm, objective conditions such as lighting changes, background changes, and nonstandard gestures are included. The entire dataset had 21 performers, of which 17 were responsible for the training set, 2 were responsible for the validation set, and 2 were responsible for the test set. The complete separation of the performers for each set effectively prevent the human body's interference in results, thereby enabling a more reliable verification of the accuracy of the gesture recognition algorithm. Figure 4 shows samples from the IsoGD dataset.
2.4 DvsGesture dataset
The DvsGesture dataset[48]4 comprises 1342 instances of a set of 11 hand and arm gestures, grouped into 122 trials collected from 29 subjects under three different lighting conditions. During each trial, one subject stood against a stationary background and performed all 11 gestures sequentially under the same lighting conditions. The gestures include hand waving (both arms), large straight arm rotations (both arms, clockwise and counterclockwise), forearm rolling (forward and backward), air guitar, air drums, and one other gesture invented by the subject. The three lighting conditions were combinations of natural light, fluorescent light, and LED light, which were selected to control the effect of shadows and fluorescent light flicker on the DVS128. Each gesture lasted approximately six seconds. The dataset includes timestamped DVS128 event files, RGB videos from a webcam mounted next to the DVS, ground-truth files with gesture labels, and the start and stop times. Figure 5 shows examples of the sample frames from the dataset.
2.5 Sheffield kinect gesture dataset
The Sheffleld Kinect Gesture (SKIG) dataset[49]5 contains 2160 hand gesture sequences (1080 RGB sequences and 1080 depth sequences) collected from six subjects. All these sequences were synchronously captured using a Kinect sensor (including an RGB camera and a depth camera). As shown in Figure 6, this dataset collects 10 categories of hand gestures in total: circle (clockwise), triangle (anti-clockwise), up-down, right-left, wave, "Z," cross, come here, turn-around, and pat. In the collection process, all ten categories were performed with three hand postures: fist, index finger, and flat. To increase the diversity, the sequences were recorded using three different backgrounds (i.e., wooden board, white plain paper, and paper with characters) and two illumination conditions (i.e., intense and low light).
2.6 EgoGesture dataset
The EgoGesture dataset[50]6 is a multimodal large-scale dataset for egocentric hand gesture recognition. The dataset contains 2081 RGB-D videos, 24161 gesture samples, and 2953224 frames from 50 distinct subjects, which were collected in six different indoor and outdoor scenes. As shown in Figure 7, there are 83 classes of gestures. The dataset integrates both RGB and depth modules through the Intel RealSense SR300. The dual-modality videos were recorded at a resolution of 640×480 pixels at 30 fps. The start and end frame indices of each gesture sample were also labeled manually, which provided the test bed for segmented gesture classification. There are 1239 training, 411 validation, and 431 testing videos. The numbers of gesture samples in the training, validation, and testing sets are 14416, 4768, and 4977, respectively.
Table 1 shows the statistics of the publicly available isolated gesture datasets commonly used for the evaluation of deep learning-based algorithms. The surveyed datasets cover a wide range of different types of actions, including gestures, simple actions, daily activities, human-object interactions, and human-human interactions. It also covers different acquisition devices, modalities, and views.
3 Multimodal inputs
In general, each form of information can be regarded as a modality. For example, an image or video can be categorized as RGB, depth, or optical flow data depending on the type of sensor that acquired it[51]. Each of these can be referred to as a modality. Even though there is consistency between the different modalities in the expression of certain distinctive features (such as the outline of an object), there are often differences in the expression of details. As shown in Figure 8, the RGB modal is better able to represent the texture of objects. Compared to the RGB modal, the depth modal can reflect the distance information of the gesture. More importantly, the depth modal is not affected by environmental factors such as light intensity and background color and can remain relatively stable under changes in the environment. The optical flow modal is good at obtaining gesture movement context information; this information is very important for dynamic gesture recognition. The skeletonization modal reduces the impact of the shooting angle and environment on the recognition effect and improves the accuracy of recognition in complex environments. These modalities are complementary in terms of information characterization. The limitations of single-modal data can be effectively compensated for by introducing other modalities. Therefore, the fusion of multimodal information can provide sufficient feature information to characterize the target comprehensively.
For example, the appearance of the "reach out" gesture with RGB modal has very little difference among frame sequences, making it difficult to distinguish the gesture category. However, the corresponding depth modal conveys that the depicted hand's pixel values vary with the distance between the hand and the camera. Thus, it is easy to identify the gesture based on pixel value differences. The appearance of the "thumbs up" gesture in depth mode shows only the outline of the hand, and the details are very vague in each frame. The appearance of the hand was clearly visible in the RGB mode. Therefore, the recognition of the "thumbs up" gesture can be greatly improved with RGB modalities. In both the depth and RGB modalities, the hand appearance of the "applause" gesture changes little throughout the frame sequence. However, the hand movement information with an optical flow modal is more prominent, which helps to judge the gesture.
Wang et al. proposed a heterogeneous network-based framework for recognizing gestures[52]. This approach utilizes 3D convolutional LSTM[53] to recognize dynamic gestures in video, and ueses convolutional networks to recognize gestures in dynamic image sequences constructed by rank pooling. Both networks use the RGB and depth modalities and only the hand regions are contained in these data. The final recognition result is obtained by taking the mean value of the output of each modality. Köpüklü et al. proposed a method of motion fused frames (MFFs), in which the gestures are represented by fusing RGB images with optical flow images[54]. Some studies use RGB and depth modalities together to describe the movement of the hand[53,55,56]. Whereas others are based on the fused features of RGB, depth, and flow modes for dynamic gesture recognition[42-44]. These studies confirm that multimodal input is an effective means to enhance feature robustness and can significantly improve the accuracy of gesture recognition.
According to the level of data abstraction, fusion methods can be divided into data-level fusion[37], feature-level fusion[57-59], and decision-level fusion[53,56,64]. Table 2 shows the statistical information of the three fusion strategies related to the publicly available isolated gesture datasets covered in this review.
Performance statistics among different fusion methods
Fusion levels Literature Strategy Dataset Accuracy/ Jaccard index(%)
Data-level [54] Motion fused 20BN Jester dataset V1 96.28
ChaLearn LAP IsoGD 57.40
Feature-level [57] AvgFusion SKIG 99.5
ChaLearn LAP IsoGD 58.65
[58] Concatenation ChaLearn LAP IsoGD 64.40
[59] CCA ChaLearn LAP IsoGD 68.14
[60] Concatenation ChaLearn LAP ConGD 26.55
[61] Concatenation Montalbano 88.90
Decision-level [62] Sparse fusion ChaLearn LAP IsoGD 80.96
[63] ModDrop fusion Montalbano 85.00
Data-level fusion occurs before data input. This method trains only one network for multimodal data inputs, significantly reducing the number of training parameters. However, it is difficult for the feature extractor to learn the optimal information carried by each channel simultaneously, as the input is complicated, which results in a low recognition rate. Moreover, the solution is difficult to implement when multimodal data are acquired from the source. In 2018, Köpüklü et al. applied data-level fusion to behavior and gesture recognition for the first time[54]. The method connected several frames of optical flow images to a single RGB image's channel dimension as an input to a deep network. The above operation enabled the input data to carry both spatial and temporal information for dynamic gesture recognition. The accuracy of this method was only 57.4% for the IsoGD[65] test dataset.
Feature-level fusion can occur anywhere between the input and output. One study[60] first extracted temporal features by inputting RGB modal data and depth modal data into two-stream recurrent neural networks (2S-RNNs) and then fused the two features together using an LSTM layer. In other studies[57,59] , each data feature output is transformed from the feature extractor into vectors. These vectors are concatenated into a single vector, and the fused vector is processed by an SVM classifier to obtain the predictions of dynamic gestures. Miao et al. proposed a fusion method based on canonical correlation analysis (CCA) [58]. This method fuses the spatiotemporal features generated by the RGB, depth, and flow modalities to maximize the pairwise correlation among the modal features. Finally, the fused features are processed by an SVM classifier to obtain gesture recognition results. Wu et al. fused multi-modal features by a weighted sum for the final recognition[61]. Although feature-level fusion can learn spatiotemporal sequence features of each channel independently, the number of parameters to be trained increases. Furthermore, fused features tend to lose the advantages of the information of each channel and the correlation information between each channel.
Compared with data- and feature-level fusion, decision-level fusion is much simpler. This approach solves the problems associated with data- and feature-level fusion while taking the advantage of the information carried by the data of each modal without modifying the network structure. Multiple input data are independent of each other during training. Specifically, each datapoint obtains the confidence of the prediction for each gesture class using the softmax function at the last layer of the network. All confidence outputs are then maximized or averaged to obtain the fusion result, which is the final output. The problem with this type of fusion approach is that the processing of the maximum directly ignores the information provided by the non-maximal counterparts, and the averaging process weakens the strength of the critical information. Both processes limit the fusion effect. To solve this problem, Narayana et al. proposed training a neural network that learns the weight of each input datapoint for each class of dynamic gestures[62]. Neverova et al. presented a method for gesture detection and localization based on multi-scale and multimodal deep learning[63]. The key to this technique is a training strategy that exploits gradual fusion involving random dropping of separate channels (called ModDrop) for learning cross-modality correlations while preserving the uniqueness of each modality-specific representation. Wang et al. introduced the score fusion strategy in decision fusion for gesture recognition, substantially improving the recognition accuracy[66]. Three kinds of depth dynamic images were generated using a video sequence (sample), referred to respectively as dynamic depth images (DDI), dynamic depth normal images (DDNI) and dynamic depth motion normal images (DDMNI). For each kind of dynmaic image, they generate two data by forward and backward computing. In this way, they have three pairs of forward and backward dynamic images and then fed them into six different trained convolutional networks using a testing video sequence (sample). For each forward-backward pair dynamic image, multiply-score fusion is used. The score vectors output by the two pairs of convolutional networks are multiplied in an element-wise manner, and the resultant score vectors are then normalized using the L1 norm. The three normalized score vectors are then multiplied in an element-wise manner, and the maximum score in the resultant vector is assigned as the probability of the test sequence being the recognized class. The index of the maximum score corresponds to the recognized class label.
4 Dynamic gesture recognition algorithm
Dynamic gesture recognition algorithms can be divided into two categories according to the way that features are acquired: machine learning algorithms based on artificial features and deep learning algorithms based on computer vision[67,68].
4.1 Dynamic gesture recognition algorithm based on machine learning
Dynamic gestures involve both temporal and spatial information, and classical methods include the dynamic Time warping (DTW) algorithm[69,70] and the hidden Markov model (HMM)[71,72]. DTW is a measure of the similarity of two time series of different lengths and was originally used for speech recognition. The purpose of the dynamic time-based regularization algorithm is to compare the similarity between two videos. Sometimes, two modalities of videos are not aligned in the time domain despite their strong similarity, so they need to be aligned before the comparison of similarity. Corradinni et al. first used a dynamic time regularization-based algorithm for the recognition of dynamic gestures[73]. The algorithm first needs to pre-process each gesture in the training set, then extracts features and normalizes them to a sequence template. The same operation is repeated on the gestures in the test set. Finally, the generated results are matched with each template in the training set, and the gesture recognition result is the category with the smallest difference from the template. The DTW method does not use the statistical model framework for training. However, it is difficult to use temporal information in an image recognition algorithm. Therefore, there are certain difficulties when dealing with large data volumes and complex gestures.
HMM is a traditional model used to deal with problems based on time series or state sequences. In the dynamic gesture recognition task, each gesture category corresponds to a HMM. In training, each gesture sample is first divided into categories, and then a corresponding HMM model is trained for each gesture category using forward and backward algorithms. During testing, in order to calculate the probability value of each HMM model to produce the test sample, the test sample traverses all HMM models using the forward algorithm. The category corresponding to the HMM model with the largest value is the test sample identification result. Saha et al. designed a method based on the HMM algorithm to classify 12 dynamic gestures under 60 different themes[74]. The classification accuracy achieved for each gesture type was nearly 90%. A dynamic gesture recognition algorithm based on HMM was proposed by Yang et al.[75]. First, the image was converted to the HSV color space, and the hand area was segmented using the color threshold method. To solve the problem of dividing continuous video sequences in practical applications, a state-based positioning algorithm was proposed to find the starting and ending points of gestures. After each gesture was segmented independently, the features of each gesture sequence were extracted. The features included the position, speed, size, shape, and other information of the hand. Finally, the HMM was trained using the feature vector set. The algorithm[75] refers to the hidden Markov toolset written by Kevin P Murphy, a researcher from Google, which has been integrated into PMTK, a machine learning library based on probabilistic models. DTW and HMM algorithms have been proposed in the dynamic gesture recognition algorithm that is based on geometric features[76]. The algorithm mainly includes three steps: image preprocessing, gesture feature extraction, and model construction based on DTW and HMM. Image preprocessing is based on a skin color model to segment the hand region. Feature extraction obtains hand information, including movement center, edge change, number of fingers, wrist angle, finger angle, area radius, and cross-sectional area. The HMM and DTW models are constructed by clustering feature sets with the K-means algorithm. Neverova presented a new gesture recognition method using multimodal data, which was constructed by nonparametrically estimating multimodal densities from a training dataset[63].
4.2 Dynamic gesture recognition algorithm based on deep learning
Deep learning methods are broadly categorized into three groups, depending on the time modeling method used for recognition, i.e., two-stream networks[77-80], 3D convolutional neural networks[80], and LSTM networks[36]. These were investigated from the perspective of the network architectures of convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
Deep convolutional neural networks have provided a qualitative leap in the performance of many computer vision tasks. Convolutional neural networks have also been used in gesture and human behavior recognition to improve algorithm performance[81]. Researchers have explored various applications in many fields using different modal data. Xu used a CNN to recognize gestures and made it possible to identify relatively complex gestures using a single inexpensive monocular camera[82]. The skeletonization algorithm[41] and CNN for the recognition algorithm reduced the impact of the shooting angle and environment on the recognition effect and improved the accuracy of gesture recognition in complex environments. Pigou considered a recognition system using Microsoft Kinect, CNNs, and GPU acceleration. Instead of constructing complex handcrafted features, CNNs can automate the process of feature construction[83].
4.2.1 Method based on two-stream networks
As a general two-dimensional CNN can only extract spatial features, for video data with timing information, learning only through the CNN can cause a large amount of information regarding the time dimension to be lost. Therefore, some scholars have attempted to compensate for the shortcomings of CNNs by adding optical information, which is the initial idea for a two-stream network.
The most noticeable feature of the two-stream network is that it consists of two subnetworks: spatial and temporal networks. The spatial information of the hand is extracted by the spatial network from RGB images as input, and the motion information of the hand is extracted by the temporal network from stacked optical streams as input. The two types of information are fused to form temporal information for the video analysis task. The spatial and temporal network structures are the same, and can be composed of classical 2D CNNs such as AlexNet, VGG, ResNet, and Inception.
The new network architecture, two-stream, was first proposed by Simonyan[77] in 2014. This network has achieved good results in video classification tasks. Since then, deep learning has been widely used in the study of gesture recognition. As shown in Figure 9, first, a randomly sampled RGB image frame from the video sequence is used as an input for the spatial stream network, and the next five frames are used as the input for the temporal stream network. The spatial and temporal features of the gestures are learned from both the single-frame RGB images and the stacked optical flow images input into the two subnetworks. The two features are then fused and used for the subsequent gesture category determination. Compared to other algorithms, this algorithm achieved optimal results on the UCF-101[84] and HMDB-51[85] datasets. However, the algorithm could only understand the short-term content of the behavior, as only one frame of images and a few frames of optical flow images from the video were selected as inputs for the two subnetworks. Nevertheless, long-term information plays a crucial role in understanding videos, and the loss of this information can greatly affect the recognition performance. To resolve this, a temporal segmentation network (TSN) was proposed by Wang et al.[78], which introduced a sparse sampling strategy to model long-time sequence information.
To obtain a more robust feature expression, TSN uses Inception v2[86] to build a subnetwork. The workflow is as follows: First, video clips are uniformly divided into N segments. For each segment, operations, such as those used by Simonyan and Zisserman[77], are performed to obtain the algorithm's predicted probability vector for each small short-term clip (RGB images and stacked optical streams share weights for the two subnetworks). The probability scores of all the clips are then averaged to obtain the prediction results for the long-term video.
To further improve the performance of the two-stream network, Feichtenhofer et al.[79] investigated a method for fusing spatial and temporal information in TSN. The experiments highlighted three points: first, the fusion of two categories of information at the convolutional layer not only degraded the performance but also reduced the number of parameters; second, fusion on a higher convolutional layer of the network is better than that on the lower convolutional layer; and third, fusion on the category score can improve the performance. This approach was used by Simonyan and Zisserman[77] and Wang et al.[78]. In the two-stream network, the motion information of a gesture is obtained from an optical stream image. However, the acquisition of this optical stream image requires pixel-level computation of the corresponding RGB images, which takes up a substantial memory space and wastes time and storage resources.
To solve this problem, Zhu et al. proposed to train a CNN, MotionNet, to achieve optical flow estimation instead of optical flow computation[80]. In this algorithm, the temporal network and MotionNet are cascaded, and several stacked RGB images are taken as input. The principles of dynamic gesture recognition and action recognition are identical. They are only similar in the fact that both try to understand the semantic information expressed by the human body through movement in a video.
With the successful application of two-stream networks in action recognition, many researchers have proposed dynamic gesture recognition based on a two-stream network. For example, Köpük et al. proposed sending several frames of an optical stream and a single RGB image into the temporal network of TSN for dynamic gesture recognition[54]. Based on skeletal modality data, Wang et al. proposed a novel two-stream RNN architecture to model both the temporal dynamics and spatial configurations for skeleton-based action recognition[87].
Traditional 2D CNNs use two-dimensional convolution and pooling methods to process image data, only modeling the two-dimensional spatial information of the data and ignoring the time-domain information. Dynamic gesture recognition tasks require not only picture data but also video stream data (which are picture sequences that contain time information). Therefore, traditional convolution and pooling methods can no longer meet the actual need. Consequently, to better model the timing information of gesture actions based on video data, a three-dimensional convolutional network is necessary.
4.2.2 Method based on 3D convolutional neural networks
3D CNNs can extract both spatial and temporal information from videos. 3D CNNs consist of several 3D convolutional layers, 3D pooling layers, and activation functions. The operations of the 3D convolutional and pooling layers on the feature map are similar to those of the 2D convolutional network. The only difference is that the 2D convolution and pooling layers operate only on a single feature map with width and height dimensions, while the 3D convolution and pooling layers operate simultaneously on multiple feature maps with width, height, and time dimensions. Thus, 3D CNNs can capture both spatial and temporal information from videos, as shown in Figure 10.
Most traditional 3D CNNs are deep 3D convolutional networks (C3D)[88]. These have a simple network structure consisting of eight convolutional layers, five pooling layers, two fully connected layers, and a softmax layer. Each convolutional layer is followed by a ReLU function that performs a nonlinear transformation of the network. Except for the first and second convolutional layers, which are each followed by a pooling layer, every other convolutional layer is connected to a pooling layer. This structure can increase the receptive field of the neural network without adding parameters. Moreover, both fully connected layers are set with dropout = 0.3.
C3D[88] is the first spatiotemporal feature extractor verified to be effective and efficient, and has been adopted by many dynamic gesture recognition algorithms. For example, Liu et al. proposed dynamic gesture recognition based on C3D[49]. Zhu et al. proposed embedding pyramid input and pyramid fusion strategies into a C3D model[64]. In this algorithm, the pyramid input divides each gesture sequence pyramidally while sampling each pyramid segment uniformly using a time jitter. This input can then maintain the multi-scale contextual information of the gesture sequence. Pyramid fusion occurs after the last fully connected layer of C3D, and a softmax layer follows feature fusion to obtain predictions of the gestures. Li et al. proposed a C3D gesture recognition method for generating multimodal fusion features by combining the advantages of RGB and depth modalities[55]. Although C3D gesture recognition achieved good results, the simple network structure restricted the feature expression ability. Deepening the depth of the network is one solution for this problem, but increasing the depth indefinitely can cause degradation problems. To solve this problem, Tran et al. introduced the concept of residuals into C3D and proposed the network structure of ResC3D[89]. The core of ResC3D, which is the residual block, can learn the constant function to deepen the network depth without degrading and even improving the performance. Li et al. initially completed a dynamic gesture recognition algorithm directly based on ResC3D[59]. To make full use of the key information in the video, Li et al. proposed the introduction of an attention mechanism before the feature extraction of the input sequence[59]. The purpose was to focus the network's attention on the video frame that effectively described the gesture and the region where the motion occurred.
4.2.3 Method based on Recurrent Neural Network
The chronological sequence of gestures is important for dynamic gesture recognition. To cater for this need, another neural network structure, the cyclic neural network (RNN), has been developed. RNN is a type of neural network used to process sequence data. Compared to a regular neural network, it can process data with variations in the sequence. For example, the meaning of one word may differ depending on the context, and RNN can solve this type of problem very well. For the dynamic gesture recognition problem, RNN can express the time dimension of information by linking the upper and lower frames. Therefore, this network structure has received considerable attention from many researchers. LSTM[90] is a special type of RNN.
As shown in Figure 11, the input
x t
is received at
time, the value of the hidden layer is
S t
, and the output is
O t
. The key point is that the value of
S t
depends not only on
x t
, but also on
S t - 1
.The final result
O t + 1
of the network at time
t + 1
is the result of the input and all history at that time, which achieves the purpose of modeling a time series. Pigou et al. proposed a new end-to-end trainable neural network architecture that incorporates temporal convolutions and bidirectional recurrence[90].
An R3DCNN network framework with connectionist temporal classification (CTC) was proposed by Molchanov et al.[56] to perform gesture recognition. As shown in Figure 12, this approach cascades a 3D CNN with an RNN. The short-term information of the video is first learned using C3D, and the long-term timing information of the sequence is learned using the RNN. The RNN enhances the connection between the previous and next frames and improves the expression of temporal features for the entire video. Nevertheless, owing to problems such as gradient inflation and gradient disappearance, RNNs tend to lose the ability to learn information over long periods.
As shown in Figure 12, a gesture video is presented in the form of short clips to a 3D-CNN for extracting local spatial-temporal features,
f t
. These features are input to a recurrent network, which aggregates transitions across several clips. The re-current network has a hidden state
h t - 1
, which is computed from the previous clips. The updated hidden state for the current clip,
h t
, is input into a softmax layer to estimate class-conditional probabilities,
S t
of the various gestures. During training, CTC is ragarded as the cost function. An RNN can be seen as a neural network that transmits in time, and its depth is the length of time. As mentioned above, the phenomenon of "gradient disappearance" can occur, but this time it happens on the temporal plane. The gradient at time t disappears after propagation through a few layers on the time axis, and the distant past cannot be affected at all. Therefore, it is only in theory that "all history" works together. In practice, this effect can only be maintained for a few timestamps. To solve the disappearance of the gradient in time, the LSTM realizes the memory function in time through switching the gate on the basis of RNN and prevents the gradient from disappearing. A single LSTM unit is illustrated in Figure 13.
LSTM is designed to solve the problem of gradient disappearance and gradient explosion during RNN training. In other words, LSTM can perform better in longer sequences than ordinary RNNs. The framework of the C3D and LSTM network cascade was proposed by Zhu et al.[53] to extract the temporal features of dynamic gestures. As shown in Figure 14, the deep architecture is composed of five components: (a) input preprocessing, (b) 3D CNN, (c) convolutional LSTM, (d) spatial pyramid pooling and fully connected layers, and (e) multimodal fusion. 3D CNN and convolutional LSTM are employed to learn long-term and short-term spatiotemporal features, respectively, and spatial pyramid pooling is adopted to normalize the spatiotemporal features for the final gesture recognition. C3D and bidirectional convolutional long short-term memory networks (BLCTMNs) were proposed by Zhang et al.[57] to learn 2D temporal feature maps. Further, the higher-level spatiotemporal features were learned from the 2D feature map using 2D CNNS for final gesture recognition.
A detailed comparison of the different methods is presented in Table 3. From the table, we can see that there is no single approach that is able to produce the best performance for all datasets.
Performance comparison among different methods
Literature Strategy Dataset Accuracy/Jaccard index(%)
[85] CNN Montalbano 78.90
[92] DNN+DCNN Montalbano 81.62
[86] Two-stream+RNN Montalbano 91.70
[34] Two-stream 20BNJester dataset V1 96.28
ChaLearn LAP IsoGD 57.40
[35] C3D ChaLearn LAP IsoGD 49.20
[40] C3D+Pyramid ChaLearn LAP IsoGD 50.93
[38] ResC3D ChaLearn LAP IsoGD 67.71
[90] RNN Montalbano 90.6
[39] ResC3D+Attention SKIG 100.0
ChaLearn LAP IsoGD 68.14
[36] R3DCNN+RNN SKIG 98.60
Montalbano 97.40
[33] C3D+LSTM SKIG 98.89
ChaLearn LAP IsoGD 51.02
[37] C3D+LSTM SKIG 98.89
ChaLearn LAP IsoGD 51.02
5 Discussion
This article summarized the research status of gesture recognition based on computer vision and analyzed related technologies from two aspects: traditional gesture recognition methods and gesture recognition methods based on deep learning. Through the above work, it can be seen that gesture recognition research has achieved excellent results and progress. However, it should be noted that many methods are still in the laboratory research stage, and are still far from practical applications in terms of robustness and real-time performance. The main topics and challenges in the research on gesture recognition based on computer vision are as follows.
5.1 Effective use of time dimension
Static gesture recognition only needs to consider the gesture characteristics of a single image; therefore, static gesture recognition is an image classification task. Conversely, dynamic gesture recognition uses videos. In addition to considering the spatial characteristics of each video frame, it is also necessary to consider changes in gesture actions. Compared with image classification, video classification also needs to consider the time dimension information. The effective use of the time dimension has always been a challenge and focus of gesture recognition.
5.2 Interference information in the video
Similar to image classification, videos taken in real-life scenarios may contain various types of interference information. Some examples include blurry gestures when a camera shakes or a performer moves suddenly during video recording and different lighting conditions causing the picture to appear too dark or too bright. In addition, the performer's skin color, clothing, and background environment all cause interference for recognition. Neural networks are inert, so when the desired feature is difficult to learn, it switches to an easier feature. In experiments, videos with different actions but similar backgrounds are often classified into the same category because of partial blurring of gestures.
5.3 Small training datasets
Most deep learning methods rely on large, labeled training datasets, such as Imagenet and kinetics 400. However, in practical scenarios, obtaining large, labeled training datasets is costly, laborious, and even impossible. Therefore, it is still challenging to effectively train deep networks using small training datasets.
5.4 Small differences between gestures
The correspondence between gestures and semantics should conform to conventional and universal principles. For some simple tasks, such as controlling the switch of an electronic device, only two gestures are required to complete it. At this time, the difference between gestures can be very large. However, in some complex tasks, such as the kinetics dataset, there are 400 actions, some of which are very similar but not the same, which undoubtedly increases the difficulty of neural network recognition.
6 Future research directions
The discussion on the challenges faced by the available methods allows us to outline several future research directions for the development of deep learning methods for motion recognition. While the list is not exhaustive, they point to research activities that may advance the field.
6.1 Gap with practical scenarios
Although many large-scale datasets have been collected during the last few years, there is a large gap between the datasets collected in controlled environments and real-world environment owing to the constrained environment setting, which includes the effect that lighting change can have on the color of a hand. For example, most available datasets do not involve many occlusion cases. However, in practical scenarios, occlusion is inevitable. Recovering or finding cues from multimodal data for such recognition tasks would be an interesting research direction.
6.2 Spatiotemporal relation information
It is rare to explicitly mine temporal or spatial information in videos with preferences at different levels. In the future, it will be interesting to mine these two types of information with different weights through deep learning.
6.3 Large-scale datasets
With the development of data-hungry deep learning approaches, there is a demand for large-scale datasets. Even though several large datasets, such as the ChaLearn LAP IsoGD dataset, focus on specific tasks, more varied large-scale datasets are required to facilitate research in this field.
6.4 Real-time gesture recognition and prediction
Real-time gesture recognition and prediction are required for practical applications, and arguably, this is the final goal of gesture recognition systems. Unlike segmented recognition, real-time gesture recognition requires the continuous analysis of gestures. The design of effective real-time gesture recognition and prediction systems using deep learning methods has attracted considerable attention. For example, the emergence of this challenge heralds the further exploration of practical application scenarios.
7 Conclusion
This paper summarized the current state of research on dynamic gesture recognition based on computer vision, and reviewed related technologies for multimodal input and dynamic gesture modeling methods. We also provided a brief overview of the research difficulties in this field. Although dynamic gesture recognition research has made significant progress in recent years, there are still many prospects for improvement. The complementary use of different data modalities enhances the ability to express the details of dynamic gestures. Based on the insights drawn from the survey, several potential research directions have been described, indicating the numerous opportunities in this field despite the advances achieved to date.



Card S K, Moran T P, Newell A. The psychology of human-computer interaction. Hillsdale, New Jersey, Lawrence Erlbaum Associates, 1983


Pollick A S, de Waal F B M. Ape gestures and language evolution. PNAS, 2007, 104(19): 8184–8189 DOI:10.1073/pnas.0702624104


Chen L, Ma N, Wang P, Li J H, Wang P F, Pang G L, Shi X J. Survey of pedestrian action recognition techniques for autonomous driving. Tsinghua Science and Technology, 2020, 25(4): 458–470 DOI:10.26599/tst.2019.9010018


D'Sa A G, Prasad B G. A survey on vision based activity recognition, its applications and challenges. In: 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP). Gangtok, India, IEEE, 2019, 1–8 DOI:10.1109/icaccp.2019.8882896


Devi M, Saharia S, Bhattacharyya D K. Dance gesture recognition: a survey. International Journal of Computer Applications, 2015, 122(5): 19–26 DOI:10.5120/21696-4803


Wang Z J, Hou Y S, Jiang K K, Dou W W, Zhang C M, Huang Z H, Guo Y J. Hand gesture recognition based on active ultrasonic sensing of smartphone: a survey. IEEE Access, 2019, 7: 111897–111922 DOI:10.1109/access.2019.2933987


Xia Z W, Lei Q J, Yang Y, Zhang H D, He Y, Wang W J, Huang M H. Vision-based hand gesture recognition for human-robot collaboration: a survey. In: 2019 5th International Conference on Control, Automation and Robotics (ICCAR). Beijing, China, IEEE, 2019, 198–205 DOI:10.1109/iccar.2019.8813509


Martínez B M, Modolo D, Xiong Y J, Tighe J. Action recognition with spatial-temporal discriminative filter banks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 5481–5490 DOI:10.1109/iccv.2019.00558


Diba A, Sharma V, Van Gool L, Stiefelhagen R. DynamoNet: dynamic action and motion network. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 6191–6200 DOI:10.1109/iccv.2019.00629


Feichtenhofer C, Fan H Q, Malik J, He K M. SlowFast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), IEEE, 2019, 6201–6210 DOI:10.1109/iccv.2019.00630


Bhowmick S, Talukdar A K, Sarma K K. Continuous hand gesture recognition for English alphabets. In: 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN). Noida, India, IEEE, 2015, 443–446 DOI:10.1109/spin.2015.7095264


Lu Z Y, Chen X, Li Q, Zhang X, Zhou P. A hand gesture recognition framework and wearable gesture-based interaction prototype for mobile devices. IEEE Transactions on Human-Machine Systems, 2014, 44(2): 293–299 DOI:10.1109/thms.2014.2302794


Zhang Y, Harrison C. Tomo: wearable, low-cost electrical impedance tomography for hand gesture recognition. UIST'15: Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 2015, 167–173 DOI:10.1145/2807442.2807480


Bobick A F, Davis J W. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3): 257–267 DOI:10.1109/34.910878


Konečný J, Hagara M. One-shot-learning gesture recognition using HOG-HOF features. In: Gesture Recognition. Cham: Springer International Publishing. 2017, 365–385 DOI:10.1007/978-3-319-57021-1_12


Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K. Long-term recurrent convolutional networks for visual recognition and description. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015, 2625–2634 DOI:10.1109/cvpr.2015.7298878


Huang Y J, Yang X C, Li Y F, Zhou D L, He K S, Liu H H. Ultrasound-based sensing models for finger motion classification. IEEE Journal of Biomedical and Health Informatics, 2018, 22(5): 1395–1405 DOI:10.1109/jbhi.2017.2766249


Yang X C, Sun X L, Zhou D L, Li Y F, Liu H H. Towards wearable A-mode ultrasound sensing for real-time finger motion recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2018, 26(6): 1199–1208 DOI:10.1109/tnsre.2018.2829913


Manawadu U E, Kamezaki M, Ishikawa M, Kawano T, Sugano S. A hand gesture based driver-vehicle interface to control lateral and longitudinal motions of an autonomous vehicle. In: 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC). Budapest, Hungary, IEEE, 2016, 001785–001790 DOI:10.1109/smc.2016.7844497


Kim J, Jung H, Kang M, Chung K. 3D human-gesture interface for fighting games using motion recognition sensor. Wireless Personal Communications, 2016, 89(3): 927–940 DOI:10.1007/s11277-016-3294-9


Yuan X, Dai S, Fang Y Y. A natural immersive closed-loop interaction method for human-robot “rock-paper-scissors” game. Recent Trends in Intelligent Computing, Communication and Devices, 2020, 103–111 DOI:10.1007/978-981-13-9406-5_14


Lichtenauer J F, Hendriks E A, Reinders M J T. Sign language recognition by combining statistical DTW and independent classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(11): 2040–2046 DOI:10.1109/tpami.2008.123


Cooper H, Ong E, Pugeault N, Bowden R. Sign language recognition using sub-units. Journal of Machine Learning Research, 2012, 2205–2231


Yang D, Lim J K, Choi Y. Early childhood education by hand gesture recognition using a smartphone based robot. In: The 23rd IEEE International Symposium on Robot and Human Interactive Communication. Edinburgh, UK, IEEE, 2014, 987–992 DOI:10.1109/roman.2014.6926381


Ismail Fawaz H, Forestier G, Weber J, Petitjean F, Idoumghar L, Muller P A. Automatic alignment of surgical videos using kinematic data. In: Artificial Intelligence in Medicine. Cham: Springer International Publishing, 2019, 104–113 DOI:10.1007/978-3-030-21642-9_14


Lu X Z, Shen J, Perugini S, Yang J J. An immersive telepresence system using RGB-D sensors and head mounted display. In: 2015 IEEE International Symposium on Multimedia (ISM). Miami, FL, USA, IEEE, 2015, 453–458 DOI:10.1109/ism.2015.108


Cheng K, Ye N, Malekian R, Wang R C. In-air gesture interaction: real time hand posture recognition using passive RFID tags. IEEE Access, 2019, 7: 94460–94472 DOI:10.1109/access.2019.2928318


Trong K N, Bui H, Pham C. Recognizing hand gestures for controlling home appliances with mobile sensors. In: 2019 11th International Conference on Knowledge and Systems Engineering (KSE). Da Nang, Vietnam, IEEE, 2019, 1–7 DOI:10.1109/kse.2019.8919419


Escalera S, Athitsos V, Guyon I. Challenges in multimodal gesture recognition. Gesture recognition, 2017, 1–60


D’Orazio T, Marani R, Renò V, Cicirelli G. Recent trends in gesture recognition: how depth data has improved classical approaches. Image and Vision Computing, 2016, 52: 56–72 DOI:10.1016/j.imavis.2016.05.007


Nyaga C, Wario R. A Review of Sign Language Hand Gesture Recognition Algorithms. In: Advances in Artificial Intelligence, Software and Systems Engineering. Cham, Springer International Publishing, 2021, 207–216


Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review, 2015, 43(1): 1–54 DOI:10.1007/s10462-012-9356-9


Cheng H, Yang L, Liu Z C. Survey on 3D hand gesture recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 26(9): 1659–1673 DOI:10.1109/tcsvt.2015.2469551


Khan R Z, Ibraheem N A. Survey on gesture recognition for hand image postures. Computer and Information Science, 2012, 5(3): 110–121 DOI:10.5539/cis.v5n3p110


Devi M, Saharia S, Bhattacharyya D K. Dance gesture recognition: a survey. International Journal of Computer Applications, 2015, 122(5): 19–26 DOI:10.5120/21696-4803


Gao Z M, Wang P C, Wang H G, Xu M L, Li W Q. A review of dynamic maps for 3D human motion recognition using ConvNets and its improvement. Neural Processing Letters, 2020, 52(2): 1501–1515 DOI:10.1007/s11063-020-10320-w


Sun J H, Ji T T, Zhang S B, Yang J K, Ji G R. Research on the hand gesture recognition based on deep learning. In: 2018 12th International Symposium on Antennas, Propagation and EM Theory (ISAPE). Hangzhou, China, IEEE, 2018, 1–4 DOI:10.1109/isape.2018.8634348


Jiang D, Li G F, Sun Y, Kong J Y, Tao B, Chen D S. Grip strength forecast and rehabilitative guidance based on adaptive neural fuzzy inference system using sEMG. Personal and Ubiquitous Computing, 2019, 1–10 DOI:10.1007/s00779-019-01268-3


Guo X, Xu W, Tang W Q, Wen C. Research on optimization of static gesture recognition based on convolution neural network. In: 2019 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE). Hohhot, China, IEEE, 2019, 398–3982 DOI:10.1109/icmcce48743.2019.00095


Sharma P, Anand R S. Depth data and fusion of feature descriptors for static gesture recognition. IET Image Processing, 2020, 14(5): 909–920 DOI:10.1049/iet-ipr.2019.0230


Jiang D, Li G F, Sun Y, Kong J Y, Tao B. Gesture recognition based on skeletonization algorithm and CNN with ASL database. Multimedia Tools and Applications, 2019, 78(21): 29953–29970 DOI:10.1007/s11042-018-6748-0


Lai K, Yanushkevich S N. CNN+RNN depth and skeleton based dynamic hand gesture recognition. In: 2018 24th International Conference on Pattern Recognition (ICPR). Beijing, China, IEEE, 2018, 3451–3456 DOI:10.1109/icpr.2018.8545718


Kajan S, Goga J, Zsíros O. Comparison of algorithms for dynamic hand gesture recognition. In: 2020 Cybernetics & Informatics (K&I). Velke Karlovice, Czech Republic, IEEE, 2020, 1–5 DOI:10.1109/ki48306.2020.9039850


Li G F, Wu H, Jiang G Z, Xu S, Liu H H. Dynamic gesture recognition in the Internet of Things. IEEE Access, 2018, 7: 23713–23724 DOI:10.1109/access.2018.2887223


Materzynska J, Berger G, Bax I, Memisevic R. The jester dataset: a large-scale video dataset of human gestures. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). Seoul, Korea (South), IEEE, 2019, 2874–2882 DOI:10.1109/iccvw.2019.00349


Escalera S, Baró X, Gonzàlez J, Bautista M A, Madadi M, Reyes M, Ponce-López V, Escalante H J, Shotton J, Guyon I. ChaLearn Looking at People Challenge 2014: Dataset and Results. In: Computer Vision-ECCV 2014 Workshops. Cham, Springer International Publishing, 2015, 45–47


Wan J, Li S Z, Zhao Y B, Zhou S, Guyon I, Escalera S. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Las Vegas, NV, USA, IEEE, 2016, 761–769 DOI:10.1109/cvprw.2016.100


Amir A, Taba B, Berg D, Melano T, McKinstry J, Di Nolfo C, Nayak T, Andreopoulos A, Garreau G, Mendoza M, Kusnitz J, Debole M, Esser S, Delbruck T, Flickner M, Modha D. A low power, fully event-based gesture recognition system. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 7388–7397 DOI:10.1109/cvpr.2017.781


Liu L, Shao L. Learning discriminative representations from RGB-D video data. In: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. Beijing, China, AAAI Press, 2013, 1493–1500


Zhang Y F, Cao C Q, Cheng J, Lu H Q. EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, 2018, 20(5): 1038–1050 DOI:10.1109/tmm.2018.2808769


Jiang D, Zheng Z J, Li G F, Sun Y, Kong J Y, Jiang G Z, Xiong H G, Tao B, Xu S, Yu H, Liu H H, Ju Z J. Gesture recognition based on binocular vision. Cluster Computing, 2019, 22(6): 13261–13271 DOI:10.1007/s10586-018-1844-5


Wang H G, Wang P C, Song Z J, Li W Q. Large-scale multimodal gesture recognition using heterogeneous networks. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 3129–3137 DOI:10.1109/iccvw.2017.370


Zhu G M, Zhang L, Shen P Y, Song J. Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access, 2017, 5: 4517–4524 DOI:10.1109/access.2017.2684186


Köpüklü O, Köse N, Rigoll G. Motion fused frames: data level fusion strategy for hand gesture recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, UT, USA, IEEE, 2018, 2184–21848 DOI:10.1109/cvprw.2018.00284


Li Y N, Miao Q G, Tian K, Fan Y Y, Xu X, Li R, Song J F. Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016, 25–30 DOI:10.1109/icpr.2016.7899602


Molchanov P, Yang X D, Gupta S, Kim K, Tyree S, Kautz J. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 4207–4215 DOI:10.1109/cvpr.2016.456


Zhang L, Zhu G M, Shen P Y, Song J, Shah S A, Bennamoun M. Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 3120–3128 DOI:10.1109/iccvw.2017.369


Miao Q G, Li Y N, Ouyang W L, Ma Z X, Xu X, Shi W K, Cao X C. Multimodal gesture recognition based on the ResC3D network. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy, IEEE, 2017, 3047–3055 DOI:10.1109/iccvw.2017.360


Li Y N, Miao Q G, Qi X D, Ma Z X, Ouyang W L. A spatiotemporal attention-based ResC3D model for large-scale gesture recognition. Machine Vision and Applications, 2019, 30(5): 875–888 DOI:10.1007/s00138-018-0996-x


Chai X J, Liu Z P, Yin F, Liu Z, Chen X L. Two streams recurrent neural networks for large-scale continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016, 31–36 DOI:10.1109/icpr.2016.7899603


Sydney, Australia, NewYork, ACMPress, 2013 DOI:10.1145/2522848.2532589


Narayana P, Beveridge J R, Draper B A. Gesture recognition: focus on the hands. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 5235–5244 DOI:10.1109/cvpr.2018.00549


Neverova N, Wolf C, Taylor G, Nebout F. ModDrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1692–1706 DOI:10.1109/tpami.2015.2461544


Zhu G M, Zhang L, Mei L, Shao J, Song J, Shen P Y. Large-scale Isolated Gesture Recognition using pyramidal 3D convolutional networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016,19–24 DOI:10.1109/icpr.2016.7899601


Wan J, Li S Z, Zhao Y B, Zhou S, Guyon I, Escalera S. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Las Vegas, NV, USA, IEEE, 2016, 761–769 DOI:10.1109/cvprw.2016.100


Wang P C, Li W Q, Liu S, Gao Z M, Tang C, Ogunbona P. Large-scale isolated gesture recognition using convolutional neural networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016, 7–12 DOI:10.1109/icpr.2016.7899599


Zhan F. Hand gesture recognition with convolution neural networks. In: 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). Los Angeles, CA, USA, IEEE, 2019, 295–298 DOI:10.1109/iri.2019.00054


Du T, Ren X M, Li H C. Gesture recognition method based on deep learning. In: 2018 33rd Youth Academic Annual Conference of Chinese Association of Automation (YAC). Nanjing, China, IEEE, 2018, 782–787 DOI:10.1109/yac.2018.8406477


Hong J Y, Park S H, Baek J G. Segmented dynamic time warping based signal pattern classification. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). New York, NY, USA, IEEE, 2019, 263–265 DOI:10.1109/cse/euc.2019.00058


Plouffe G, Cretu A M. Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE Transactions on Instrumentation and Measurement, 2016, 65(2): 305–316 DOI:10.1109/tim.2015.2498560


Fine S, Singer Y, Tishby N. The hierarchical hidden Markov model: analysis and applications. Machine Learning, 1998, 32(1): 41–62 DOI:10.1023/a:1007469218079


Haid M, Budaker B, Geiger M, Husfeldt D, Hartmann M, Berezowski N. Inertial-based gesture recognition for artificial intelligent cockpit control using hidden Markov models. In: 2019 IEEE International Conference on Consumer Electronics (ICCE). Las Vegas, NV, USA, IEEE, 2019, 1–4 DOI:10.1109/icce.2019.8662036


Corradini A. Dynamic time warping for off-line recognition of a small gesture vocabulary. In: Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. Vancouver, BC, Canada, IEEE, 2001, 82–89 DOI:10.1109/ratfg.2001.938914


Saha S, Lahiri R, Konar A, Banerjee B, Nagar A K. HMM-based gesture recognition system using kinect sensor for improvised human-computer interaction. In: 2017 International Joint Conference on Neural Networks (IJCNN). Anchorage, AK, USA, IEEE, 2017, 2776–2783 DOI:10.1109/ijcnn.2017.7966198


Yang Z, Li Y, Chen W D, Zheng Y. Dynamic hand gesture recognition using hidden Markov models. In: 2012 7th International Conference on Computer Science & Education (ICCSE). Melbourne, VIC, Australia, IEEE, 2012, 360–365 DOI:10.1109/iccse.2012.6295092


Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT Press, 2012


Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. 2014


Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O, van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 20–36 DOI:10.1007/978-3-319-46484-8_2


Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 1933–1941 DOI:10.1109/cvpr.2016.213


Zhu Y, Lan Z Z, Newsam S, Hauptmann A. Hidden two-stream convolutional networks for action recognition. In: Computer Vision–ACCV 2018. Cham: Springer International Publishing, 2019, 363–378 DOI:10.1007/978-3-030-20893-6_23


Wu D, Pigou L, Kindermans P J, Le N D H, Shao L, Dambre J, Odobez J M. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1583–1597 DOI:10.1109/tpami.2016.2537340


Xu P. A real-time hand gesture recognition and human-computer interaction system. 2017


Pigou L, Dieleman S, Kindermans P J, Schrauwen B. Sign language recognition using convolutional neural networks. In: Computer Vision-ECCV 2014 Workshops. Cham: Springer International Publishing, 2015, 572–578 DOI:10.1007/978-3-319-16178-5_40


Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012


Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. Barcelona, Spain, IEEE, 2011, 2556–2563 DOI:10.1109/iccv.2011.6126543


Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 2818–2826 DOI:10.1109/cvpr.2016.308


Wang H S, Wang L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 3633–3642 DOI:10.1109/cvpr.2017.387


Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015, 4489–4497 DOI:10.1109/iccv.2015.510


Tran D, Ray J, Shou Z, Chang S F, Paluri M. ConvNet architecture search for spatiotemporal feature learning. 2017


Pigou L, Oord A, Dieleman S, Herreweghe M, Dambre J. Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 2018, 126(2/3/4): 430–439 DOI:10.1007/s11263-016-0957-7


Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780 DOI:10.1162/neco.1997.9.8.1735


Zhao S Y, Yang W K, Wang Y G. A new hand segmentation method based on fully convolutional network. In: 2018 Chinese Control And Decision Conference (CCDC). Shenyang, China, IEEE, 2018, 5966–5970 DOI:10.1109/ccdc.2018.8408176