Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2021,  3 (3):   248 - 260

Published Date:2021-6-20 DOI: 10.1016/j.vrih.2021.05.004


There is a large group of deaf-mutes worldwide, and sign language is a major communication tool in this community. It is necessary for deaf-mutes to be able to communicate with others who are capable of hearing, and hearing people also need to understand sign language, which produces a great demand for sign language tuition. Even though there have already been a large number of books written about sign language, it is inefficient to learn sign language through reading alone, and the same can be said on watching videos. To solve this problem, we developed a smartphone-based interactive Chinese sign language teaching system that facilitates sign language learning.
The system provides a learner with some learning modes and captures the learner's actions using the front camera of the smartphone. At present, the system provides a vocabulary set with 1000 frequently used words, and the learner can evaluate his/her sign action by subjective or objective comparison. In the mode of word recognition, the users can play any word within the vocabulary, and the system will return the top three retrieved candidates; thus, it can remind the learners what the sign is.
This system provides interactive learning to enable a user to efficiently learn sign language. The system adopts an algorithm based on point cloud recognition to evaluate a user's sign and costs about 700ms of inference time for each sample, which meets the real-time requirements.
This interactive learning system decreases the communication barriers between deaf-mutes and hearing people.


1 Introduction
The World Health Organization (WHO) reported that there are approximately 466 million people worldwide with disabling hearing loss 1 . This is over 5% of the world's population, and 34 million of these people are children. The Second China National Sample Survey on Disability that the number of people with hearing disabilities in China was about 20.04 million, accounting for 1.53% of the total population at that time[1]. In such a wide community, sign language is a major communication tool that expresses specific meanings with movement, position, and orientation of hands, body, and facial expressions. Sign language is a fully-fledged natural language with its own lexicon and grammar.
With the development of society, an increasing number of people attach importance to learning sign language. However, the teaching of sign language suffers from several challenges, such as the number of qualified teachers, lack of interaction during in-class learning, and limited attention to each student[2]. As smartphones are very popular today, this work aims to provide an interactive teaching system to those who are willing to learn sign language.
In this study, we designed and implemented a Chinese sign language teaching system on a smartphone that covers the Chinese sign language vocabulary set with 1000 frequently used words. As shown in Figure 1, the system provides multiple functions for learning: users can learn the description of a sign language word, sketch, and demonstration video in word learning mode; interactive learning through demonstration show-action play and assessment in learning test mode; and text and sign searching in dictionary searching mode. Compared to in-class sign language learning, this system provides an opportunity for personalized one-to-one learning. The developed system can be a valuable educational tool for out-of-school practice. It can also be a learning platform for those who want to learn sign language and for deaf people who want to learn Mandarin.
2 Related work
2.1 Computer vision for assistive technologies
Over the last few decades, there has been a tremendous increase in demand for Assistive Technologies (AT), which are useful for overcoming functional limitations of individuals and improving their quality of life[3]. Lee and Medioni presented a vision-based wearable RGB-D camera indoor navigation system[4]. The system guides the visually impaired user from one location to another without providing map or GPS information in advance. Kayukawa et al. developed an assistive suitcase system, BBeep, to support blind people when walking through crowded environments[5].
Some studies have also focused on social medical service systems. Zeng et al. developed a hand gesture system for Human Computer Interaction (HCI) and applied it to intelligent wheelchair control[6]. Parmar et al. collected skeletal data of patients with cerebral palsy during rehabilitation using the Kinect device and evaluated the quality of their movements[7]. Feng et al. proposed a vision-based fall detection method for monitoring elderly people in a home care environment[8]. Martinel et al. proposed a food recognition system that can automatically identify the type and quantity of food and estimate its calorie content[9].
2.2 Existing sign language teaching systems
At present, computer assistive technology is used in sign language teaching, which greatly improves learning efficiency. Some sign language teaching systems have also been developed.
Based on a comprehensive understanding of the way and basic requirements of sign language teaching, Zhang designed a sign language teaching platform using a Kinect sensor[10]. The system develops new functions to input and recognize sign language, which can provide support for current sign language teaching.
Sagawa and Takeuchi developed a Japanese sign language teaching system to help sign language studies[11]. The system includes sign language recognition, which can recognize the inputted Japanese sign language gesture and translate it into Japanese, and sign language generation, which can translate Japanese into Japanese sign language and display it in the form of 3DCG (three-dimensional computer graphics) animation.
Phan et al. designed a sign language teaching system called My Interactive Coach (MIC) [12]. This system can teach learners Australian signs, allow them to practice, record their practiced performance, and provide different forms of visual feedback on how well they performed the signs.
2.3 Sign language recognition
Sign language recognition has attracted an increasing number of researchers[13,14]. Since 2013, some research teams have conducted a series of studies on sign language recognition of isolated words based on Convolutional Neural Networks (CNN). Pigou et al. used a CNN to extract single-frame gesture features, and finally connected three fully connected layers for classification[15]. Their method achieved an accuracy of 91.7 % for vocabulary set of 20 Italian sign language words. Köpüklü et al. fused motion information into static images, sent the spatiotemporal features after fusion into the subsequent CNN network for recognition[16].
The CNN only encodes features within the single-frame image data. 3D-CNN can simultaneously model the spatio-temporal features of a slip video to capture the motion information from multiple frames and realize a more global sense. The 3D-CNN was first proposed by Ji et al. and was used for behavior recognition[17]. To achieve gesture recognition with large-scale videos, Li et al. proposed a C3D model based on RGB-D data[18,19]. Huang et al. proposed a type of sign language recognition method based on the spatial-temporal attention mechanism of a 3D-CNN network, which integrated spatial attention into the network, focused on the hand features of interest, and finally used the temporal attention mechanism to classify the important sign language movements[20].
Compared to the two networks mentioned above, Rerrent Neural Network (RNN) is a neural network used to process sequential data, and it is better at capturing long-term contextual information. Therefore, in recent years, more and more research has been based on sign language recognition using RNN. Cate et al. recognized 95 categories of isolated words in sign language by using sequential modeling of RNN speech video features[21]. Chai et al. proposed a Two streams Recurrent Neural Networks (2s-RNN) in 2016, which ranked first in the Chalearn gesture recognition challenge in the same year[22]. Bantupalli and Xie proposed an RNN network based on an inception model for sign language recognition[23]. Liao et al. sent segmented hand features and original RGB data into the BLSTM and achieved accurate identification of sign language isolated words[24].
3 System design and implementation
In this study, we designed and implemented a Chinese sign language teaching interactive app on a smartphone. Users can interactively learn sign language and evaluate their sign action expressions through this system. As shown in Figure 2, the main interface of this app contains three functional modules. Users can enter the word-learning interface by clicking the upper word learning function, with three learning modes, and obtain the learning statistics from this part. After learning words, users can evaluate the quality of the learning process by clicking icons in the lower half of the interface. The bottom tab provides auxiliary functions, such as text searching and sign searching. We explain each of these functional modules in the following sections.
Figure 1 shows the total architecture of the proposed system. We adopt multilevel interfaces to enable users to easily start and switch between interfaces. To provide better feedback for the learning process, we adopt the sign language quality assessment method to score the sign language performed by users in time and the sign language recognition method to search for relevant signs in the database to enable understanding and comparison. Relevant algorithms are introduced in Section 3.4.
3.1 Word learning
This system contains a multimedia vocabulary set of the 1000 most frequently used words in "Basic Gestures of Chinese Sign Language"[25]. To better organize the learning schedule for such a large vocabulary, the app provides several learning modes, including sequential learning, unit learning, and random learning. In the sequential learning, users can learn sign language words step-by-step, and the app will record the learning history and start from the last word reviewed in the most recent session. In unit learning, we divide the 1000 sign words into 31 units, with approximately 30 words per unit and group them into different categories, such as appellation and occupation. Learners can choose the units that they prefer to learn systematically. In contrast, to weaken the connection between words and avoid the occurrence of confusion, we also provide a random learning mode, which provides the learner with a random order of words.
Furthermore, we add a 'Daily learning' function to the app's home page. Learners can set the number of words they learn each day so that they can learn in their personalized schedule. In addition, the proportion of new words could be set to balance the study and review.
Once a learning mode is selected, the user enters the word-demonstration interface. As shown in Figure 3a, the learning interface displays the demonstration video, sketch, and textual description from top to bottom. Learner can obtain a comprehensive understanding of this sign word, and the app also provides pause and multiple playback speeds to enable the user to capture video details at his/her preferred speed.
In particular, we have a function similar to listening and repeating in spoken language learning: Play with me, as shown in Figure 3b. Learners can enter this function by clicking on a video at the interface. The application will first play a demo video and then wait for 3s before recording. Learners can follow the demo video and practice the same word. After recording, the application will rate learners' performance through the sign language quality assessment algorithm, which provides timely feedback to learners to achieve a better learning experience.
3.2 Learning testing
In addition to learning modules, we also provide examination modules so that learners can evaluate their learning results. Since sign language is a method of communication, we will test learners' learning results from two aspects: recognition of sign words and quality of sign performance. Figure 4 shows these two modules.
The identity module mainly examines the learner's understanding of sign language. The app will play a sign video and provide multiple choices, and learners need to choose the correct one from three options. If learners make the wrong choice, the correct answer will be displayed, and learners will understand what they have misunderstood. After that, learners can click the '>' button to generate the next question.
In the Sign module, the app uses an assessment algorithm to evaluate the learner's actions. Learners need to make the corresponding sign action, according to the hint from the app. After that, the app then gives the learner a score based on his or her performance, and learners can keep practicing to achieve higher scores.
3.3 Dictionary searching
The dictionary searching mode is a function of auxiliary teaching, which is convenient for users to quickly find the information they need. This module includes two functions: text searching and sign searching.
As Figure 5a shows, text searching is the retrieval from text to sign words. When a user forgets the action of a sign word, the user can input the word he/she wants to query in the search box at the top of the interface, and this app will return the words using fuzzy word matching.
Sign searching can recognize the inputted signed gesture and translate it into Mandarin. When the user sees a strange sign word, he/she can use this function to query it. As Figure 5b shows, the user clicks the BEGIN button to start recording the video and clicks the button again after he/she finishes the sign action. Subsequently, the recognition algorithm in the system processes the recorded video. After a very short wait, the app will provide the top three candidates.
3.4 Recognition and evaluation algorithm
We use the front camera of the smartphone to capture images, which are used to recognize and evaluate learners' sign actions. First, the point cloud sequence is extracted from the depth video and sent to the improved FlickerNet[26] to extract the spatio-temporal features. Then, we proposed a time sequence partitioning model to enhance core gesture information by block prediction. Finally, all local features
are concatenated to form video feature
. The framework of the proposed approach is illustrated in Figure 6. The operators used in the network are supported by mainstream mobile deep learning, which can be deployed easily and quickly.
3.4.1 Point cloud formation
Considering the smartphone's limited computational power and the inability to handle complex network structures such as 3D convolution, we use a point cloud instead of an RGB image as the input. Each pixel in the depth image represents the distance from the camera and the confidence level. When using sign language for communication, the hands mainly appear in front of the body. Therefore, we can separate the hand area by setting a threshold. Subsequently,
points are sampled (128 points in default) from the hand area through random sampling or farthest point sampling as the initial point cloud input. An example of this is shown in Figure 7.
3.4.2 Network structure
The input of the network is a point cloud of size
T × N × d
, where
is the number of frames,
is the number of points sampled per frame, and
is the number of channels per point. We use only the
( x ,   y ,   z ,   t )
as our point input for simplicity and efficiency
( d = 4 )
. We used a modified version of FlickerNet to extract features. For a sign language video, we obtain a feature map of
T × N ' × C
, where
N '
is the number of points after down-sampling, and
is the output dimension of the points.
RNN is not suitable for running on smartphones because of its relatively complex network structure. Each Chinese sign language word usually contains two or three atomic actions. Therefore, we can divide a sign language video into several small segments and make predictions separately. However, it is not realistic to accurately divide each atomic action. Following the idea of PCB[27], we divide the features along the time dimensions to obtain
time blocks, as shown in Figure 8. For each part, we adopt an average pooling operation to obtain
local features
, and use
1 × 1
convolution to reduce the dimension of
, which extracts the new local features
. Next, each feature
is input into a fully connected layer to predict the class probability of the input.
During training, we optimize the network by minimizing the sum of the cross-entropy losses of
classifiers. In the testing phase,
local feature vectors
can be connected to form the feature vector
of the entire video. For sign language recognition, top-3 classes are selected as candidates.
To enable word learning and measure learning progress, we use the algorithm of quality evaluation. Our evaluation criterion is whether the sign can be recognized correctly. The learner signs the word, and if the algorithm can correctly recognize it, the learner's score is higher. Conversely, if the algorithm recognizes it as another word, the learner's score is lower. Specifically, we obtain the probability of recognition for each word using the previous network. These probabilities are normalized to 0-1 using the softmax function. The probability of selecting the target category was magnified 100 times to obtain the final percentage score.
3.4.3 Deployment model
As this is an interactive application with the critical requirement of low latency, it is extremely important that this app can run on an edge device in real time. At present, mainstream deep learning frameworks support the deployment of mobile terminals. PyTorch released PyTorch Mobile at the end of 2019, which supports an end-to-end workflow from Python to deployment on iOS and Android. Users can seamlessly transition from a training model to a deployment model without the need for other tools.
Therefore, we use PyTorch to write this model and train on a desktop, and then convert it to a TorchScript file by calling the " torch.jit.trace" method. Finally, the file in the "assert" folder is placed on Android. In the Android project, PyTorch Android is added as a gradle dependency in build.gradle to the project. Then, the app can read the model in an assert folder.
For the smartphone, we use the HUAWEI Mate20 Pro because it is equipped with a 3D depth perception camera. It uses structural light to capture 3D data in the range of 70cm to obtain depth images. We also tested the application on Huawei Mate30 Pro.
3.5 Interactive design
To be concise, most of the buttons in the system are represented by ICONs, which avoids the use of words and reduces the barriers between languages. The user can click on the ICONs to select and jump to the selected functions. In addition, the application adds sliding controls, such as left-slide and right-slide, to switch words (Figure 9a), and up-slide to return.
When a user records videos, the optimal distance from the smartphone is approximately 50cm. In this case, it is not convenient to reach out and click the button on the screen. Therefore, we added a space gesture control function that can help the user operate the app at a distance.
The app captures the depth image from the front camera of HUAWEI Mate20 Pro and crops the hand area by presetting the threshold, then draws a "red dot" on the screen to indicate the position of the hand area in the entire depth image, as shown in Figure 9b. The learner can control the movement of the "red dot" by moving his/her hand. To satisfy daily habits, the left and right movements are mirrored. If the "red dot" stays in the ICON area for more than 1s, the app will determine that the learner has clicked it. In this way, learners can switch words during the learning process, start and end video recordings, etc. They can also choose whether to turn on this function in the setting interface.
4 Experiments
4.1 Evaluation of recognition
We perform comprehensive studies on the Chinese Sign Language Dataset, collected from 30 invited deaf-mutes. For the common most 100 words in Chinese sign language, each person performed them twice, and 5892 valid videos were obtained. We select the data of 27 people (a total of 5297 videos) as the training set, and the other (a total of 595 videos) as the test set. We train the models from scratch for 200 epochs with a mini-batch size of 32. Adam, with a momentum of 0.9, is used with a learning rate of 10-4, which was divided by 10 at epoch 100.
Experiments on the number of frames. The effective number of frames for each video is between 20 and 70 frames. To solve the problem of video length inconsistency, we uniformly sample T frames from a depth video. As the number of frames increased, the recognition accuracy also increased. However, when increasing from 16 to 32 frames, the performance improvement is not as obvious. The experimental results are presented in Figure 10a.
Experiments on the number of points. We need to extract
points from the depth image as a hand point cloud. The greater the number of point clouds, the more accurate the description of the hand shape, and the better the recognition performance. The experimental results are presented in Figure 10b. As 128 points and 256 points have similar performance, we use 128 points for subsequent experiments.
Experiments on timing partitioning number. In the temporal block model, the feature vector feature of the video is divided into
local feature vectors according to the time dimension. We conduct experiments to determine the value of
. To ensure that it can be divided evenly, the value of
is a factor of 16 frames of video length. When
p = 1
, it is equivalent to no blocking.
p = 16
indicates that each frame is processed separately. The final experimental results are presented in Figure 10c. When
p = 4
, the best effect is achieved.
To verify the generalizability of our model, we conduct experiments on the SHREC'17 Dataset[28], which is a public dynamic hand gesture dataset. Table 1 details our performance comparison with several state-of-the-art methods. Because the SHREC'17 dataset provides skeletal data of the hand and fingers, most previous works use skeleton sequences as inputs that provide relatively accurate hand pose structures and joint trajectories. The work[29] demonstrated that clouds based methods can achieve excellent performance even if accurate hand poses are difficult to obtain in real-world cases. In contrast, the network structure proposed in this paper is more concise and easier to deploy to a mobile terminal.
Performance comparison (%) on SHREC17 dataset
Method Modality 14 gestures 28 gestures
Key frames[28] Depth sequence 82.9 71.9
SoCJ+HoHD+HoWR[30] skeleton 88.2 81.9
Res-TCN[31] skeleton 91.1 87.3
STA-Res-TCN[31] skeleton 93.6 90.7
ST-GCN[32] skeleton 92.7 87.7
DG-STA[33] skeleton 94.4 90.7
PointLSTM-middle[29] point clouds 95.9 94.7
baseline point clouds 91.6 89.2
Ours point clouds 96.3 95.0
Recognition algorithms play an important role in the implemented system. Through the algorithm, we can not only recognize sign language and provide more accurate candidate results for the word recognition module, but also score more accurately in quality evaluation to provide more effective feedback to users.
4.2 User study
To evaluate the usability of this system, we invited 10 students, who had little sign language learning experience, to take the test. For comparison, we chose two free sign language apps[34,35] in the Android application market to join the test. We first introduced the usage and main functions of each application to students. Each student was allowed to use the software freely for ten minutes, after which they rated the app according to the following five aspects: comprehensibility of learning content, learning feedback, learning interest, interactive experience, and system stability.
Figure 11 shows a statistical graph of the user experience in comparison with the other two experimental apps. The data of each indicator in the figure are the average values after the statistics. Each item is scored from 0 to 5. The higher the project score, the better the user experience is. As can be seen from the figure, compared with the previous two systems, the system proposed in this paper provides feedback to users and promotes more efficient learning. Meanwhile, as our system features an interactive experience, the users felt more engaged.
At the same time, students generally believe that the system can effectively improve the learning and the make the process of learning sign language more interesting, unlike the less engaging learning method of reading books. However, there is a disadvantage in that the operation is a little tedious, and the whole process sometimes requires more time than referring to books.
5 Discussion
We developed a smartphone-based Chinese sign language teaching system with different learning modes. The teaching system integrates recent sign language technologies, which achieve both accurate and efficient results for sign language recognition and quality evaluation. The developed system helps users learn sign language through a smartphone anytime and anywhere, which reduces the communication barriers between the deaf and those not hard of hearing. This system can also be applied to similar teaching scenarios, such as the gestures of traffic police.
This system still has some aspects that can be improved. We will continue to improve functions and interactions so that the system can be used to learn not only isolated words but also continuous sign language. In addition, we use recognition probability as the basis for sign language assessment. Other assessment methods could be explored in future studies. We hope that this app will be applied to public welfare to improve the study and life of the disabled.



Gong C, Bin C, Lei Z. Study on the prevention and strategy of disability in China. Procedia-Social and Behavioral Sciences, 2010, 2(5): 6906–6913 DOI:10.1016/j.sbspro.2010.05.041


Lu W. Research and suggestions on sign language teaching in Chinese universities. Journal of Changchun University, 2014


Leo M, Medioni G, Trivedi M, Kanade T, Farinella G M. Computer vision for assistive technologies. Computer Vision and Image Understanding, 2017, 154: 1–15 DOI:10.1016/j.cviu.2016.09.001


Lee Y H, Medioni G. RGB-D camera based wearable navigation system for the visually impaired. Computer Vision and Image Understanding, 2016, 149: 3–20 DOI:10.1016/j.cviu.2016.03.019


Kayukawa S, Higuchi K, Guerreiro J, Morishima S, Sato Y, Kitani K, Asakawa C. BBeep: a sonic collision avoidance system for blind travellers and nearby pedestrians. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. Glasgow Scotland Uk, New York, NY, USA, ACM, 2019, 1–12 DOI:10.1145/3290605.3300282


Zeng J H, Sun Y R, Wang F. A natural hand gesture system for intelligent human-computer interaction and medical assistance. In: 2012 Third Global Congress on Intelligent Systems. Wuhan, China, IEEE, 2012, 382–385 DOI:10.1109/gcis.2012.60


Parmar P, Morris B T. Measuring the quality of exercises. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Orlando, FL, USA, IEEE, 2016, 2241–2244 DOI:10.1109/embc.2016.7591175


Feng W G, Liu R, Zhu M. Fall detection for elderly person care in a vision-based home surveillance environment using a monocular camera. Signal, Image and Video Processing, 2014, 8(6): 1129–1138 DOI:10.1007/s11760-014-0645-4


Martinel N, Piciarelli C, Micheloni C. A supervised extreme learning committee for food recognition. Computer Vision and Image Understanding, 2016, 148: 67–86 DOI:10.1016/j.cviu.2016.01.012


Zhang S. Design research in Kinect based sign-language teaching system. East China Normal University. 2014


Juan-les-Pins, France, York New, Press ACM, 2002, 137–145 DOI:10.1145/641007.641035


Phan H D, Ellis K, Dorin A. MIC, an interactive sign language teaching system. In: Proceedings of the 30th Australian Conference on Computer-Human Interaction. Melbourne Australia, New York, NY, USA, ACM, 2018, 544–547 DOI:10.1145/3292147.3292237


IEEE transactions on pattern analysis and machine intelligence-cover. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(5): 0_1–673 DOI:10.1109/tpami.2004.1273910


Jiang X W, Satapathy S C, Yang L X, Wang S H, Zhang Y D. A survey on artificial intelligence in Chinese sign language recognition. Arabian Journal for Science and Engineering, 2020, 45(12): 9859–9894 DOI:10.1007/s13369-020-04758-2


Pigou L, Dieleman S, Kindermans P J, Schrauwen B. Sign language recognition using convolutional neural networks. European Conference on Computer Vision. Springer, Cham, 2014, 572–578 DOI:10.1007/978-3-319-16178-5_40


Köpüklü O, Köse N, Rigoll G. Motion fused frames: data level fusion strategy for hand gesture recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, UT, USA, IEEE, 2018, 2184–21848 DOI:10.1109/cvprw.2018.00284


IEEE transactions on pattern analysis and machine intelligence information for authors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(6): C3 DOI:10.1109/tpami.2015.2427753


Li Y N, Miao Q G, Tian K, Fan Y Y, Xu X, Li R, Song J F. Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016, 25–30 DOI:10.1109/icpr.2016.7899602


Li Y N, Miao Q G, Tian K, Fan Y Y, Xu X, Li R, Song J F. Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(10): 2956–2964 DOI:10.1109/tcsvt.2017.2749509


Huang J, Zhou W G, Li H Q, Li W P. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(9): 2822–2832 DOI:10.1109/tcsvt.2018.2870740


Cate H, Dalvi F, Hussain Z. Sign language recognition using temporal classification. 2017


Chai X J, Liu Z P, Yin F, Liu Z, Chen X L. Two streams recurrent neural networks for large-scale continuous gesture recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico, IEEE, 2016, 31–36 DOI:10.1109/icpr.2016.7899603


Bantupalli K, Xie Y. American sign language recognition using deep learning and computer vision. In: 2018 IEEE International Conference on Big Data (Big Data). Seattle, WA, USA, IEEE, 2018, 4896–4899 DOI:10.1109/bigdata.2018.8622141


Liao Y Q, Xiong P W, Min W D, Min W Q, Lu J H. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks. IEEE Access, 2019, 7: 38044–38054 DOI:10.1109/access.2019.2904749


General Administration of Quality Supervision, Inspection and Quarantine of the People's Republic of China. Basic Chinese Sign Language Gestures. China Standards Press, 2009


Min Y, Chai X, Zhao L, Chen X. FlickerNet: Adaptive 3D gesture recognition from sparse point clouds. British Machine Vision Conference. 2019, 105


Sun Y, Zheng L, Yang Y, Tian Q, Wang S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). Proceedings of the European Conference on Computer Vision. 2018, 480–496 DOI:10.1007/978-3-030-01225-0_30


SunY F, Zheng L, Yang Y, Tian Q, Wang S J. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). Computer Vision-ECCV 2018, 2018, 480–496 DOI:10.1007/978-3-030-01225-0_30


de Smedt Q, Wannous H, -P Vandeborre J, Guerry J, le Saux B, Filliat D. 3D hand gesture recognition using a depth and skeletal dataset: SHREC'17 track. 3Dor '17: Proceedings of the Workshop on 3D Object Retrieval. 2017, 33–38 DOI:10.2312/3dor.20171049


Min Y C, Zhang Y X, Chai X J, Chen X L. An efficient PointLSTM for point clouds based gesture recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, IEEE, 2020, 5760–5769 DOI:10.1109/cvpr42600.2020.00580


De Smedt Q, Wannous H, Vandeborre J P. Skeleton-based dynamic hand gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Las Vegas, NV, USA, IEEE, 2016, 1206–1214 DOI:10.1109/cvprw.2016.153


Hou J X, Wang G J, Chen X H, Xue J H, Zhu R, Yang H Z. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. Computer Vision-ECCV 2018 Workshops, 2019 DOI:10.1007/978-3-030-11024-6_18


Peng W, Hong X P, Chen H Y, Zhao G Y. Learning graph convolutional network for skeleton-based human action recognition by neural searching. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(3): 2669–2676 DOI:10.1609/aaai.v34i03.5652


Chen Y X, Zhao L, Peng X, Yuan J B, Metaxas D N. Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. 2019


Luo Y. Love Sign Language (Version 1.3.5), 2020


Changsha Keenbow Information Technology Co., Ltd. 2020, Keenbow Sign Language (Version 1.4)