Adaptive crossfusion learning for multimodal gesture recognition
1. Macau University of Science and Technology, Macau 999078, China
2. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
3. Baidu Research, Beijing 100193, China, and National Engineering Laboratory for Deep Learning Technology and Application, Beijing 100193, China
Abstract
Keywords： Gesture recognition ; Multimodal fusion ; RGBD
Content
Feature Streams  Fusion  Output Size  FLOPs  #Parameters  

C3D  ACmW  C3D  ACmW  
Stage1 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$
$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$
$N\times \text{}64\text{}\times \text{}32\times \text{}56\text{}\times \text{}56$

8.6 GB  154.2 MB  5.3 KB  3.1 KB  
Stage2 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$
$N\times \text{}128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$
$N\times \text{}128\text{}\times \text{}16\times \text{}28\text{}\times \text{}28$

88.9 GB  38.5 MB  221.6 KB  772.0 B  
Stage3 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 256\text{}\times \text{}8\times \text{}14\text{}\times \text{}14$
$N\times \text{}256\text{}\times \text{}8\times \text{}14\times \text{}14$
$N\times \text{}256\text{}\times \text{}8\times \text{}14\text{}\times \text{}14$

133.3 GB  9.6 MB  2.7 MB  196.0 B  
Stage4 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 512\text{}\times \text{}4\times \text{}7\text{}\times \text{}7$
$N\times \text{}512\text{}\times \text{}4\times \text{}7\times \text{}7$
$N\times \text{}512\text{}\times \text{}4\times \text{}7\text{}\times \text{}7$

66.6 GB  2.4 MB  10.6 MB  52.0 B  
Stage5 
RGB Features + Depth Features 
Spatial Temporal
$\odot $

$N\times 512\text{}\times \text{}1\times \text{}4\times \text{}4$
$N\times \text{}512\text{}\times \text{}1\times \text{}4\times \text{}4$
$N\times \text{}512\text{}\times 1\times \text{}4\text{}\times \text{}4$

11.1 GB  196.6 KB  14.2 MB  7.0 B  
Total  —  —  —  308.7 GB  204.9 MB  79.0 MB  4.1 KB 
Dataset  RGB  Depth  Fusion Strategy  

$\mathrm{S}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{i}\mathrm{p}\mathrm{l}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{A}\mathrm{C}\mathrm{m}\mathrm{W}$


IsoGD  53.18%  54.22%  58.10%  57.42%  59.97% 
NVGesture  78.54%  80.83%  82.16%  81.33%  83.96% 
Dataset  RGB  Depth  Fusion Strategy  

$\mathrm{S}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{M}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{i}\mathrm{p}\mathrm{l}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e}\text{}\mathrm{F}\mathrm{u}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}$

$\mathrm{A}\mathrm{C}\mathrm{m}\mathrm{W}$


IsoGD  47.12%  49.39%  54.12%  53.45%  56.23% 
NVGesture  75.83%  77.29%  79.67%  76.97%  81.32% 
Method  Modality  Accuracy (%) 

cConvNet^{[44]}  RGBD  44.80 
Pyramidal C3D^{[45]}  RGBD  45.02 
2SCVN+3DDSN^{[46]}  RGBD  49.17 
32frame C3D^{[47]}  RGBD  49.20 
AHL^{[48]}  RGBD  54.14 
3DCNN+LSTM^{[49]}  RGBD  55.29 
ACmW (Ours)  RGBD  59.97 
Method  Modality  Accuracy (%) 

$\mathrm{H}\mathrm{O}\mathrm{G}+\mathrm{H}\mathrm{O}{\mathrm{G}}^{2}$
^{ [50]}

RGBD  36.90 
I3D^{[51]}  RGBD  83.82 
ACmW (Ours)  RGBD  83.96 
