Chinese
Adv Search
Home | Accepted | Article In Press | Current Issue | Archive | Special Issues | Collections | Featured Articles | Statistics

2021, 3(1): 55-64 Published Date:2021-2-20

DOI: 10.1016/j.vrih.2020.11.005

Learning long-term temporal contexts using skip RNN for continuous emotion recognition

Full Text: PDF (2) HTML (101)

Export: EndNote | Reference Manager | ProCite | BibTex | RefWorks

Abstract:

Background
Continuous emotion recognition as a function of time assigns emotional values to every frame in a sequence. Incorporating long-term temporal context information is essential for continuous emotion recognition tasks.
Methods
For this purpose, we employ a window of feature frames in place of a single frame as inputs to strengthen the temporal modeling at the feature level. The ideas of frame skipping and temporal pooling are utilized to alleviate the resulting redundancy. At the model level, we leverage the skip recurrent neural network to model the long-term temporal variability by skipping trivial information for continuous emotion recognition.
Results
The experimental results using the AVEC 2017 database demonstrate that our proposed methods are beneficial to a performance improvement. Further, the skip long short-term memory (LSTM) model can focus on the critical emotional state when training the models, thereby achieving a better performance than the LSTM model and other methods.
Keywords: Continuous emotion recognition ; Skip RNN ; Temporal contexts ; Redundancy

Cite this article:

Jian HUANG, Bin LIU, Jianhua TAO. Learning long-term temporal contexts using skip RNN for continuous emotion recognition. Virtual Reality & Intelligent Hardware, 2021, 3(1): 55-64 DOI:10.1016/j.vrih.2020.11.005

1. Tao J, Tan T. Affective computing: a review. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction. Springer, 2005

2. Gunes H, Pantic M. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, 2010, 1(1): 68–99 DOI:10.4018/jse.2010101605

3. Huang J, Li Y, Tao J, Lian Z, Yi J. Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017

4. Chen S Z, Jin Q, Zhao J M, Wang S. Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. View Mountain, California, USA, York New, Press ACM, 2017 DOI:10.1145/3133944.3133949

5. Valstar M, Pantic M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R. AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, Netherlands The, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988258

6. Le D, Provost E M. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2013

7. Cardinal P, Dehak N, Koerich A L, Alam J, Boucher P. ETS system for AV+EC 2015 challenge. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge-AVEC '15. Brisbane, Australia, NewYork, Press ACM, 2015 DOI:10.1145/2808196.2811639

8. Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L. Multimodal emotion recognition for AVEC 2016 challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, TheNetherlands, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988268

9. Han J, Zhang Z X, Schmitt M, Ren Z, Ringeval F, Schuller B. Bags in bag: generating context-aware bags for tracking emotions from speech. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-996

10. Hermansky H, Sharma S. TRAPS-classifiers of temporal patterns. In: Proceedings of International Conference on Spoken Language Processing. 1998

11. Chen B, Zhu Q, Morgan N. Learning long-term temporal features in LVCSR using neural networks. In: Proceedings of International Conference on Spoken Language Processing. 2004

12. Zeyer A, Irie K, Schlüter R, Ney H. Improved training of end-to-end attention models for speech recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1616

13. Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 2013, 31(2): 153–163 DOI:10.1016/j.imavis.2012.03.001

14. Khorram S, Aldeneh Z, Dimitriadis D, McInnis M, Provost E M. Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. In: Interspeech 2017. ISCA, 2017 DOI:10.21437/interspeech.2017-548

15. Zhang T, Zheng W M, Cui Z, Zong Y, Li Y. Spatial–temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics, 2019, 49(3): 839–847 DOI:10.1109/tcyb.2017.2788081

16. Li P C, Song Y, McLoughlin I, Guo W, Dai L R. An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1242

17. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Zafeirio S. features? Adieu End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics. IEEE, 2016

18. Huang J, Li Y, Tao J H, Lian Z, Niu M Y, Yang M H. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop-AVEC'18. Seoul, Republic of Korea, New York, ACM Press, 2018 DOI:10.1145/3266302.3266304

19. Myer S, Tomar V S. Efficient keyword spotting using time delay neural networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1979

20. Yu A W, Lee H, Le Q V. Learning to skim text. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017

21. Seo M, Min S, Farhadi A. Neural speed reading via skim-RNN. In: ICLR. 2018

22. Campos V, Jou B, Giro-i-Nieto X, Torres J, Chang S. Skip RNN: learning to skip state updates in recurrent neural networks. In: ICLR. 2017

23. Aldeneh Z, Provost E M. Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017

24. Ringeval F, Pantic M, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M. AVEC 2017: real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. MountainView, California, USA, NewYork, Press ACM, 2017 DOI:10.1145/3133944.3133953

25. Eyben F, Scherer K R, Schuller B W, Sundberg J, Andre E, Busso C, Devillers L Y, Epps J, Laukka P, Narayanan S S, Truong K P. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016, 7(2): 190–202 DOI:10.1109/taffc.2015.2457417

26. Eyben F, Wöllmer M, Schuller B. Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the International Conference on Multimedia-MM '10. Firenze, Italy, NewYork, Press ACM, 2010 DOI:10.1145/1873951.1874246

27. Baltrusaitis T, Robinson P, Morency L P. OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision. IEEE, 2016

28. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of 3rd International Conference for Learning Representations. San Diego, 2015

29. Lin L I K. A concordance correlation coefficient to evaluate reproducibility. Biometrics, 1989, 45(1): 255 DOI:10.2307/2532051

email E-mail this page

Articles by authors

VRIH