Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board

Available Online:2021-02-26 DOI: 10.3724/SP.J.2096-5796.20.00096

Adaptive cross-fusion learning for multi-modal gesture recognition

Abstract

Background
Gesture recognition has attracted significant attention because of its wide range of potential applications. Although multi-modal gesture recognition has made significant progress in recent years, a popular method still is simply fusing prediction scores at the end of each branch, which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.
Methods
This paper proposes an Adaptive Cross-modal Weighting (ACmW) scheme to exploit complementarity features from RGB-D data in this study. The scheme learns relations among different modalities by combining the features of different data streams. The proposed ACmW module contains two key functions: (1) fusing complementary features from multiple streams through an adaptive one-dimensional convolution; and (2) modeling the correlation of multi-stream complementary features in the time dimension. Through the effective combination of these two functional modules, the proposed ACmW can automatically analyze the relationship between the complementary features from different streams, and can fuse them in the spatial and temporal dimensions.
Results
Extensive experiments validate the effectiveness of the proposed method, and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.

Keyword

Gesture recognition ; Multi-modal fusion ; RGB-D

Cite this article

Benjia ZHOU, Jun WAN, Yanyan LIANG, Guodong GUO. Adaptive cross-fusion learning for multi-modal gesture recognition. Virtual Reality & Intelligent Hardware DOI:10.3724/SP.J.2096-5796.20.00096

References

1. Liu X, Shi H L, Hong X P, Chen H Y, Tao D C, Zhao G Y. 3D skeletal gesture recognition via hidden states exploration. IEEE Transactions on Image Processing, 2020, 29: 4583–4597 DOI:10.1109/tip.2020.2974061

2. Liu X, Zhao G. 3D skeletal gesture recognition via discriminative coding on time-warping invariant riemannian trajectories. IEEE Transactions on Multimedia, 2020, 99: 1 DOI:10.1109/TMM.2020.3003783

3. Rautaray S S, Agrawal A. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review, 2015, 43(1): 1–54 DOI:10.1007/s10462-012-9356-9

4. Weissmann J, Salomon R. Gesture recognition for virtual reality applications using data gloves and neural networks. In: IJCNN. IEEE, 1999

5. Sun Y, Xu C, Li G F, Xu W F, Kong J Y, Jiang D, Tao B, Chen D S. Intelligent human computer interaction based on non redundant EMG signal. Alexandria Engineering Journal, 2020, 59(3): 1149–1157 DOI:10.1016/j.aej.2020.01.015

6. Miao Q, Li Y, Ouyang W, Ma Z, Cao X. Multimodal gesture recognition based on the ResC3D network. In: 2017 IEEE International Conference on Computer Vision Workshop (ICCVW). IEEE, 2017

7. Molchanov P, Yang X, Gupta S, Kim K, Kautz J. Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016

8. Roitberg A, Pollert T, Haurilet M, Martin M, Stiefelhagen R. Analysis of deep fusion strategies for multi-modal gesture recognition. In: CVPR Workshops. 2019

9. Wang P C, Li W Q, Ogunbona P, Wan J, Escalera S. RGB-D-based human motion recognition with deep learning: a survey. Computer Vision and Image Understanding, 2018, 171: 118–139 DOI:10.1016/j.cviu.2018.04.007

10. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: ICPR. IEEE, 2016

11. Neverova N, Wolf C, Taylor G, Nebout F. ModDrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1692–1706 DOI:10.1109/tpami.2015.2461544

12. Pitsikalis V, Katsamanis A, Theodorakis S, Maragos P. Multimodal gesture recognition via multiple hypotheses rescoring. In: Gesture Recognition. Springer International Publishing, 2017, 467–496 DOI:10.1007/978-3-319-57021-1_16

13. Wang P, Li W, Liu S, Gao Z, Tang C, Ogunbona P. Large-scale isolated gesture recognition using convolutional neural networks. In: ICPR. IEEE, 2016

14. Zhu G M, Zhang L, Shen P Y, Song J. Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access, 2017, 5: 4517–4524 DOI:10.1109/access.2017.2684186

15. Sun S, Pang J, Shi J, Yi S, Ouyang W. Fishnet: A versatile backbone for image, region, and pixel level prediction. In: NIPS. 2018

16. Narayana P, Beveridge J R, Draper B A. Gesture recognition: focus on the hands. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018

17. Malgireddy M R, Inwogu I, Govindaraju V. A temporal Bayesian model for classifying, detecting and localizing activities in video sequences. In: Computer Vision & Pattern Recognition Workshops. IEEE, 2012

18. Wan J, Guo G D, Li S Z. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(8): 1626–1639 DOI:10.1109/tpami.2015.2513479

19. Wan J, Ruan Q Q, Li W, An G Y, Zhao R Z. 3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos. Journal of Electronic Imaging, 2014, 23(2): 023017 DOI:10.1117/1.jei.23.2.023017

20. Wan J, Ruan Q Q, Li W, Deng S. One-shot learning gesture recognition from RGB-D data using bag of features. In: Gesture Recognition. Springer International Publishing, 2017, 329–364 DOI:10.1007/978-3-319-57021-1_11

21. Ji X P, Cheng J, Tao D P, Wu X Y, Feng W. The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowledge-Based Systems, 2017, 122: 64–74 DOI:10.1016/j.knosys.2017.01.035

22. Zhuang L, Liu Z, Chai X, Chen X. Continuous gesture recognition with hand-oriented spatiotemporal feature. In: IEEE International Conference on Computer Vision Workshop. IEEE Computer Society, 2017

23. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition invideos. In: NIPS. 2014

24. Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In: AAAI. 2018

25. Zhang L, Zhu G, Mei L, Shen P, Shah S A A, Bennamoun M. Attention in convolutional lstm for gesture recognition. In: NIPS. 2018

26. Duan H J, Sun Y, Cheng W T, Jiang D, Yun J T, Liu Y, Liu Y B, Zhou D L. Gesture recognition based on multi-modal feature weight. Concurrency and Computation: Practice and Experience, 2020 DOI:10.1002/cpe.5991

27. He Y, Li G F, Liao Y J, Sun Y, Kong J Y, Jiang G Z, Jiang D, Tao B, Xu S, Liu H H. Gesture recognition based on an improved local sparse representation classification algorithm. Cluster Computing, 2019, 22(S5): 10935–10946 DOI:10.1007/s10586-017-1237-1

28. Jiang D, Zheng Z J, Li G F, Sun Y, Kong J Y, Jiang G Z, Xiong H G, Tao B, Xu S, Yu H, Liu H H, Ju Z J. Gesture recognition based on binocular vision. Cluster Computing, 2019, 22(S6): 13261–13271 DOI:10.1007/s10586-018-1844-5

29. Jiang D, Li G F, Sun Y, Kong J Y, Tao B. Gesture recognition based on skeletonization algorithm and CNN with ASL database. Multimedia Tools and Applications, 2019, 78(21): 29953–29970 DOI:10.1007/s11042-018-6748-0

30. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: ICCV. 2015

31. Zhu G M, Zhang L, Lin Y. Redundancy and attention in convolutional LSTM for gesture recognition. IEEE Transactions on Neural Networks and Learning Systems, 2019

32. Yang X, Molchanov P, Kautz J. Making convolutional networks recurrent for visual sequence learning. In: CVPR. 2018

33. Wang H, Wang P, Song Z, Li W. Large-scale multimodal gesture recognition using heterogeneous networks. In: ICCV Workshops. 2017

34. Zhang L, Zhu G, Shen P, Song J, Shah S A, Bennamoun M. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: ICCV. 2017

35. Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P. Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: ICPR. IEEE, 2016

36. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. TCSVT, 2018, 28(10): 2956–2964

37. Kopuklu O, Kose N, Rigoll G. Motion fused frames: data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018

38. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: CVPR. IEEE, 2018

39. Hu T K, Lin Y Y, Hsiu P C. Learning adaptive hidden layers for mobile gesture recognition. In: AAAI. 2018

40. Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li S Z. Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: CVPR Workshops. 2016

41. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: NIPS 2017 Workshop Autodiff Submission. 2017

42. Li Y N, Miao Q G, Tian K, Fan Y Y, Xu X, Ma Z X, Song J F. Large-scale gesture recognition with a fusion of RGB-D data based on optical flow and the C3D model. Pattern Recognition Letters, 2019, 119: 187–194 DOI:10.1016/j.patrec.2017.12.003

43. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. IEEE, 2017

44. Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In: AAAI. 2018

45. Zhu G, Zhang L, Mei L, Shao J, Song J, Shen P. Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In Pattern Recognition (ICPR) 2016 23rd International Conference. IEEE, 2016

46. Duan J L, Wan J, Zhou S, Guo X Y, Li S Z. A unified framework for multi-modal isolated gesture recognition. ACM Transactions on Multimedia Computing, Communications, and Applications, 2018, 14(1s): 1–16 DOI:10.1145/3131343

47. Li Y, Miao Q, Tian K, Fan Y, Xu X, Li R, Song J. Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In 2016 23rd International Conference on Pattern Recognition (ICPR). 2016

48. Hu T K, Lin Y Y, Hsiu P C. Learning adaptive hidden layers for mobile gesture recognition. In: AAAI. 2018

49. Zhang L, Zhu G, Shen P, Song J, Shah S A. Bennamoun M. Learning spatiotemporal features using 3DCNN and convolutional lstm for gesture recognition. In: ICCV. 2017

50. Ohn-Bar E, Trivedi M M. Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 2014, 15(6): 2368–2377 DOI:10.1109/tits.2014.2337331

51. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. IEEE, 2017

Related

1. Benjia ZHOU, Jun WAN, Yanyan LIANG, Guodong GUO, Adaptive cross-fusion learning for multi-modal gesture recognition Virtual Reality & Intelligent Hardware 2021, 3(3): 235-247

2. Jiaxin LIU, Hongxin ZHANG, Chuankang LI, COMTIS: Customizable touchless interaction system for large screen visualization Virtual Reality & Intelligent Hardware 2020, 2(2): 162-174

3. Yang LI, Jin HUANG, Feng TIAN, Hong-An WANG, Guo-Zhong DAI, Gesture interaction in virtual reality Virtual Reality & Intelligent Hardware 2019, 1(1): 84-112

4. Meng SONG, Shiyi LIU, Ge YU, Lili GUO, Dangxiao WANG, An immersive space liquid bridge experiment system with gesture interaction and vibrotactile feedback Virtual Reality & Intelligent Hardware 2019, 1(2): 219-232

5. Egemen ERTUGRUL, Ping LI, Bin SHENG, On attaining user-friendly hand gesture interfaces to control existing GUIs Virtual Reality & Intelligent Hardware 2020, 2(2): 153-161

6. Mengting XIAO, Zhiquan FENG, Xiaohui YANG, Tao XU, Qingbei GUO, Multimodal interaction design and application in augmented reality for chemical experiment Virtual Reality & Intelligent Hardware 2020, 2(4): 291-304

7. Lin HUANG, Boshen ZHANG, Zhilin GUO, Yang XIAO, Zhiguo CAO, Junsong YUAN, Survey on depth and RGB image-based 3D hand shape and pose estimation Virtual Reality & Intelligent Hardware 2021, 3(3): 207-234

8. Yuanyuan SHI, Yunan LI, Xiaolong FU, Kaibin MIAO, Qiguang MIAO, Review of dynamic gesture recognition Virtual Reality & Intelligent Hardware 2021, 3(3): 183-206