Adaptive crossfusion learning for multimodal gesture recognition
1. Macau University of Science and Technology, Macau 999078, China
2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
3. Baidu Research, Beijing 100193, China, and National Engineering Laboratory for Deep Learning Technology and Application, Beijing 100193, China
Abstract
Keywords： Gesture recognition ; Multimodal fusion ; RGBD
Content
Feature Streams  Fusion  Output Size  FLOPs  #Parameters  

C3D  ACmW  C3D  ACmW  
Stage1 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$
$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$
$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$

8.6 GB  154.2 MB  5.3 KB  3.1 KB  
Stage2 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$
$N\times \text{}128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$
$N\times \text{}128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$

88.9 GB  38.5 MB  221.6 KB  772.0 B  
Stage3 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 256\text{}\times \text{}8\times \text{}14\text{}\times \text{}14$
$N\times \text{}256\text{}\times \text{}8\times \text{}14\times \text{}14$
$N\times \text{}256\text{}\times \text{}8\times \text{}14\text{}\times \text{}14$

133.3 GB  9.6 MB  2.7 MB  196.0 B  
Stage4 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 512\text{}\times \text{}4\times \text{}7\text{}\times \text{}7$
$N\times \text{}512\text{}\times \text{}4\times \text{}7\times \text{}7$
$N\times \text{}512\text{}\times \text{}4\times \text{}7\text{}\times \text{}7$

66.6 GB  2.4 MB  10.6 MB  52.0 B  
Stage5 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 512\text{}\times \text{}1\times \text{}4\times \text{}4$
$N\times \text{}512\text{}\times \text{}1\times \text{}4\times \text{}4$
$N\times \text{}512\text{}\times 1\times \text{}4\text{}\times \text{}4$

11.1 GB  196.6 KB  14.2 MB  7.0 B  
Total  —  —  —  308.7 GB  204.9 MB  79.0 MB  4.1 KB 
Dataset  RGB  Depth  Fusion Strategy  

$\mathrm{S}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{i}\mathrm{p}\mathrm{l}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{A}\mathrm{C}\mathrm{m}\mathrm{W}$


IsoGD  53.18%  54.22%  58.10%  57.42%  59.97% 
NVGesture  78.54%  80.83%  82.16%  81.33%  83.96% 
Dataset  RGB  Depth  Fusion Strategy  

$\mathrm{S}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{i}\mathrm{p}\mathrm{l}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{A}\mathrm{C}\mathrm{m}\mathrm{W}$


IsoGD  47.12%  49.39%  54.12%  53.45%  56.23% 
NVGesture  75.83%  77.29%  79.67%  76.97%  81.32% 
Method  Modality  Accuracy (%) 

cConvNet^{[44]}  RGBD  44.80 
Pyramidal C3D^{[45]}  RGBD  45.02 
2SCVN+3DDSN^{[46]}  RGBD  49.17 
32frame C3D^{[47]}  RGBD  49.20 
AHL^{[48]}  RGBD  54.14 
3DCNN+LSTM^{[49]}  RGBD  55.29 
ACmW (Ours)  RGBD  59.97 
Method  Modality  Accuracy (%) 

$\mathrm{H}\mathrm{O}\mathrm{G}+\mathrm{H}\mathrm{O}{\mathrm{G}}^{2}$
^{ [50]}

RGBD  36.90 
I3D^{[51]}  RGBD  83.82 
ACmW (Ours)  RGBD  83.96 
Reference
Liu X, Shi H L, Hong X P, Chen H Y, Tao D C, Zhao G Y. 3D skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 2020, 29: 4583–4597 DOI:10.1109/tip.2020.2974061
Liu X, Zhao G. 3D skeletal gesture recognition via discriminative coding on timewarping invariant riemannian trajectories. IEEE Transactions on Multimedia, 2020, 99: 1 DOI:10.1109/TMM.2020.3003783
Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review, 2015, 43(1): 1–54 DOI:10.1007/s1046201293569
Weissmann J, Salomon R. Gesture recognition for virtual reality applications using data gloves and neural networks. In: IJCNN. IEEE, 1999
Sun Y, Xu C, Li G F, Xu W F, Kong J Y, Jiang D, Tao B, Chen D S. Intelligent human computer interaction based on non redundant EMG signal. Alexandria Engineering Journal, 2020, 59(3): 1149–1157 DOI:10.1016/j.aej.2020.01.015
Miao Q, Li Y, Ouyang W, Ma Z, Cao X. Multimodal gesture recognition based on the ResC3D network. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW). IEEE, 2017
Molchanov P, Yang X, Gupta S, Kim K, Kautz J. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016
Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R. Analysis of deep fusion strategies for multimodal gesture recognition. In: CVPR Workshops. 2019
Wang P C, Li W Q, Ogunbona P, Wan J, Escalera S. RGBDbased human motion recognition with deep learning: a survey. Computer Vision and Image Understanding, 2018, 171: 118–139 DOI:10.1016/j.cviu.2018.04.007
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Largescale gesture recognition with a fusion of RGBD data based on the C3D model. In: ICPR. IEEE, 2016
Neverova N, Wolf C, Taylor G, Nebout F. ModDrop: adaptive multimodal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1692–1706 DOI:10.1109/tpami.2015.2461544
Pitsikalis V, Katsamanis A, Theodorakis S, Maragos P. Multimodal gesture recognition via multiple hypotheses rescoring. In: Gesture Recognition. Springer International Publishing, 2017, 467–496 DOI:10.1007/9783319570211_16
Wang P, Li W, Liu S, Gao Z, Tang C, Ogunbona P. Largescale isolated gesture recognition using convolutional neural networks. In: ICPR. IEEE, 2016
Zhu G M, Zhang L, Shen P Y, Song J. Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access, 2017, 5: 4517–4524 DOI:10.1109/access.2017.2684186
Sun S, Pang J, Shi J, Yi S, Ouyang W. Fishnet: A versatile backbone for image, region, and pixel level prediction. In: NIPS. 2018
Narayana P, Beveridge J R, Draper B A. Gesture recognition: focus on the hands. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018
Malgireddy M R, Inwogu I, Govindaraju V. A temporal Bayesian model for classifying, detecting and localizing activities in video sequences. In: Computer Vision & Pattern Recognition Workshops. IEEE, 2012
Wan J, Guo G D, Li S Z. Explore efficient local features from RGBD data for oneshot learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1626–1639 DOI:10.1109/tpami.2015.2513479
Wan J, Ruan Q Q, Li W, An G Y, Zhao R Z. 3D SMoSIFT: threedimensional sparse motion scale invariant feature transform for activity recognition from RGBD videos. Journal of Electronic Imaging, 2014, 23(2): 023017 DOI:10.1117/1.jei.23.2.023017
Wan J, Ruan Q Q, Li W, Deng S. Oneshot learning gesture recognition from RGBD data using bag of features. In: Gesture Recognition. Springer International Publishing, 2017, 329–364 DOI:10.1007/9783319570211_11
Ji X P, Cheng J, Tao D P, Wu X Y, Feng W. The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences. KnowledgeBased Systems, 2017, 122: 64–74 DOI:10.1016/j.knosys.2017.01.035
Zhuang L, Liu Z, Chai X, Chen X. Continuous gesture recognition with handoriented spatiotemporal feature. In: IEEE International Conference on Computer Vision Workshop. IEEE Computer Society, 2017
Simonyan K, Zisserman A. Twostream convolutional networks for action recognition invideos. In: NIPS. 2014
Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGBD action recognition. In: AAAI. 2018
Zhang L, Zhu G, Mei L, Shen P, Shah S A A, Bennamoun M. Attention in convolutional lstm for gesture recognition. In: NIPS. 2018
Duan H J, Sun Y, Cheng W T, Jiang D, Yun J T, Liu Y, Liu Y B, Zhou D L. Gesture recognition based on multimodal feature weight. Concurrency and Computation: Practice and Experience, 2020 DOI:10.1002/cpe.5991
He Y, Li G F, Liao Y J, Sun Y, Kong J Y, Jiang G Z, Jiang D, Tao B, Xu S, Liu H H. Gesture recognition based on an improved local sparse representation classification algorithm. Cluster Computing, 2019, 22(S5): 10935–10946 DOI:10.1007/s1058601712371
Jiang D, Zheng Z J, Li G F, Sun Y, Kong J Y, Jiang G Z, Xiong H G, Tao B, Xu S, Yu H, Liu H H, Ju Z J. Gesture recognition based on binocular vision. Cluster Computing, 2019, 22(S6): 13261–13271 DOI:10.1007/s1058601818445
Jiang D, Li G F, Sun Y, Kong J Y, Tao B. Gesture recognition based on skeletonization algorithm and CNN with ASL database. Multimedia Tools and Applications, 2019, 78(21): 29953–29970 DOI:10.1007/s1104201867480
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: ICCV. 2015
Zhu G M, Zhang L, Lin Y. Redundancy and attention in convolutional LSTM for gesture recognition. IEEE Transactions on Neural Networks and Learning Systems, 2019
Yang X, Molchanov P, Kautz J. Making convolutional networks recurrent for visual sequence learning. In: CVPR. 2018
Wang H, Wang P, Song Z, Li W. Largescale multimodal gesture recognition using heterogeneous networks. In: ICCV Workshops. 2017
Zhang L, Zhu G, Shen P, Song J, Shah S A, Bennamoun M. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: ICCV. 2017
Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P. Largescale isolated gesture recognition using pyramidal 3D convolutional networks. In: ICPR. IEEE, 2016
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Largescale gesture recognition with a fusion of RGBD data based on saliency theory and C3D model. TCSVT, 2018, 28(10): 2956–2964
Kopuklu O, Kose N, Rigoll G. Motion fused frames: data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018
Hu J, Shen L, Sun G. Squeezeandexcitation networks. In: CVPR. IEEE, 2018
Hu T K, Lin Y Y, Hsiu P C. Learning adaptive hidden layers for mobile gesture recognition. In: AAAI. 2018
Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li S Z. Chalearn looking at people RGBD isolated and continuous datasets for gesture recognition. In: CVPR Workshops. 2016
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: NIPS 2017 Workshop Autodiff Submission. 2017
Li Y N, Miao Q G, Tian K, Fan Y Y, Xu X, Ma Z X, Song J F. Largescale gesture recognition with a fusion of RGBD data based on optical flow and the C3D model. Pattern Recognition Letters, 2019, 119: 187–194 DOI:10.1016/j.patrec.2017.12.003
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. IEEE, 2017
Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGBD action recognition. In: AAAI. 2018
Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P. Largescale isolated gesture recognition using pyramidal 3D convolutional networks. In Pattern Recognition (ICPR) 2016 23rd International Conference. IEEE, 2016
Duan J L, Wan J, Zhou S, Guo X Y, Li S Z. A unified framework for multimodal isolated gesture recognition. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(1s): 1–16 DOI:10.1145/3131343
Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Largescale gesture recognition with a fusion of RGBD data based on the C3D model. In 2016 23rd International Conference on Pattern Recognition (ICPR). 2016
Hu T K, Lin Y Y, Hsiu P C. Learning adaptive hidden layers for mobile gesture recognition. In: AAAI. 2018
Zhang L, Zhu G, Shen P, Song J, Shah S A. Bennamoun M. Learning spatiotemporal features using 3DCNN and convolutional lstm for gesture recognition. In: ICCV. 2017
OhnBar E, Trivedi M M. Hand gesture recognition in real time for automotive interfaces: a multimodal visionbased approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 2014, 15(6): 2368–2377 DOI:10.1109/tits.2014.2337331
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. IEEE, 2017