Learning long-term temporal contexts using skip RNN for continuous emotion recognition
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100049, China
2. CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing 100049, China
3. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Abstract
Keywords: Continuous emotion recognition ; Skip RNN ; Temporal contexts ; Redundancy
Content



Arousal | Valence | |||
---|---|---|---|---|
Audio | Visual | Audio | Visual | |
Original | 0.473 | 0.589 | 0.423 | 0.650 |
Context1 | 0.431 | 0.571 | 0.431 | 0.655 |
Context2 | 0.484 | 0.584 | 0.469 | 0.695 |
Context3 | 0.512 | 0.602 | 0.455 | 0.736 |
Context4 | 0.464 | 0.611 | 0.436 | 0.701 |
Arousal | Valence | |||
---|---|---|---|---|
Audio | Visual | Audio | Visual | |
Original | 0.487 | 0.483 | 0.420 | 0.634 |
Context1 | 0.473 | 0.472 | 0.415 | 0.641 |
Context2 | 0.465 | 0.622 | 0.447 | 0.690 |
Context3 | 0.531 | 0.603 | 0.427 | 0.698 |
Context4 | 0.492 | 0.562 | 0.441 | 0.666 |
Arousal | Valence | |||
---|---|---|---|---|
Audio | Visual | Audio | Visual | |
Original | 59.17±8.89 | 52.91±5.90 | 84.43±1.56 | 59.09±0.92 |
Context1 | 65.00±2.74 | 50.02±0.73 | 59.02±3.78 | 45.27±3.53 |
Context2 | 89.02±1.93 | 69.06±1.91 | 88.39±3.10 | 81.56±3.03 |
Context3 | 93.01±1.93 | 91.54±2.04 | 95.29±1.66 | 89.03±1.07 |
Context4 | 99.29±1.83 | 76.92±3.95 | 99.99±0.01 | 82.55±0.94 |

Reference
Tao J, Tan T. Affective computing: a review. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction. Springer, 2005
Gunes H, Pantic M. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, 2010, 1(1): 68–99 DOI:10.4018/jse.2010101605
Huang J, Li Y, Tao J, Lian Z, Yi J. Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017
Chen S Z, Jin Q, Zhao J M, Wang S. Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. View Mountain, California, USA, York New, Press ACM, 2017 DOI:10.1145/3133944.3133949
Valstar M, Pantic M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R. AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, Netherlands The, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988258
Le D, Provost E M. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2013
Cardinal P, Dehak N, Koerich A L, Alam J, Boucher P. ETS system for AV+EC 2015 challenge. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge-AVEC '15. Brisbane, Australia, NewYork, Press ACM, 2015 DOI:10.1145/2808196.2811639
Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L. Multimodal emotion recognition for AVEC 2016 challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, TheNetherlands, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988268
Han J, Zhang Z X, Schmitt M, Ren Z, Ringeval F, Schuller B. Bags in bag: generating context-aware bags for tracking emotions from speech. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-996
Hermansky H, Sharma S. TRAPS-classifiers of temporal patterns. In: Proceedings of International Conference on Spoken Language Processing. 1998
Chen B, Zhu Q, Morgan N. Learning long-term temporal features in LVCSR using neural networks. In: Proceedings of International Conference on Spoken Language Processing. 2004
Zeyer A, Irie K, Schlüter R, Ney H. Improved training of end-to-end attention models for speech recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1616
Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 2013, 31(2): 153–163 DOI:10.1016/j.imavis.2012.03.001
Khorram S, Aldeneh Z, Dimitriadis D, McInnis M, Provost E M. Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. In: Interspeech 2017. ISCA, 2017 DOI:10.21437/interspeech.2017-548
Zhang T, Zheng W M, Cui Z, Zong Y, Li Y. Spatial–temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics, 2019, 49(3): 839–847 DOI:10.1109/tcyb.2017.2788081
Li P C, Song Y, McLoughlin I, Guo W, Dai L R. An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1242
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Zafeirio S. features? Adieu End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics. IEEE, 2016
Huang J, Li Y, Tao J H, Lian Z, Niu M Y, Yang M H. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop-AVEC'18. Seoul, Republic of Korea, New York, ACM Press, 2018 DOI:10.1145/3266302.3266304
Myer S, Tomar V S. Efficient keyword spotting using time delay neural networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1979
Yu A W, Lee H, Le Q V. Learning to skim text. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017
Seo M, Min S, Farhadi A. Neural speed reading via skim-RNN. In: ICLR. 2018
Campos V, Jou B, Giro-i-Nieto X, Torres J, Chang S. Skip RNN: learning to skip state updates in recurrent neural networks. In: ICLR. 2017
Aldeneh Z, Provost E M. Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017
Ringeval F, Pantic M, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M. AVEC 2017: real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. MountainView, California, USA, NewYork, Press ACM, 2017 DOI:10.1145/3133944.3133953
Eyben F, Scherer K R, Schuller B W, Sundberg J, Andre E, Busso C, Devillers L Y, Epps J, Laukka P, Narayanan S S, Truong K P. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016, 7(2): 190–202 DOI:10.1109/taffc.2015.2457417
Eyben F, Wöllmer M, Schuller B. Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the International Conference on Multimedia-MM '10. Firenze, Italy, NewYork, Press ACM, 2010 DOI:10.1145/1873951.1874246
Baltrusaitis T, Robinson P, Morency L P. OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision. IEEE, 2016
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of 3rd International Conference for Learning Representations. San Diego, 2015
Lin L I K. A concordance correlation coefficient to evaluate reproducibility. Biometrics, 1989, 45(1): 255 DOI:10.2307/2532051