Gesture recognition has attracted significant attention because of its wide range of potential applications. Although multi-modal gesture recognition has made significant progress in recent years, a popular method still is simply fusing prediction scores at the end of each branch, which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature.
This paper proposes an Adaptive Cross-modal Weighting (ACmW) scheme to exploit complementarity features from RGB-D data in this study. The scheme learns relations among different modalities by combining the features of different data streams. The proposed ACmW module contains two key functions: (1) fusing complementary features from multiple streams through an adaptive one-dimensional convolution; and (2) modeling the correlation of multi-stream complementary features in the time dimension. Through the effective combination of these two functional modules, the proposed ACmW can automatically analyze the relationship between the complementary features from different streams, and can fuse them in the spatial and temporal dimensions.
Extensive experiments validate the effectiveness of the proposed method, and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.