Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board
<< Previous Next >>

2021, 3(1): 43-54

Published Date:2021-2-20 DOI: 10.1016/j.vrih.2020.12.002

Self-attention transfer networks for speech emotion recognition

Abstract

Background
A crucial element of human-machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is learning robust and discriminative representations from speech. Although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Herein, we apply the log-Mel spectrogram with deltas and delta-deltas as inputs. Moreover, given that emotions are time-dependent, we apply temporal convolutional neural networks to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm to learn long-term dependencies. The self-attention transfer network (SATN) in our proposed approach takes advantage of attention transfer to learn attention from speech recognition, followed by transferring this knowledge into SER. An evaluation built on Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrates the effectiveness of the proposed model.

Keyword

Speech emotion recognition ; Attention transfer ; Self-attention ; Temporal convolutional neural networks (TCNs)

Cite this article

Ziping ZHAO, Keru Wang, Zhongtian BAO, Zixing ZHANG, Nicholas CUMMINS, Shihuang SUN, Haishuai WANG, Jianhua TAO, Björn W. SCHULLER. Self-attention transfer networks for speech emotion recognition. Virtual Reality & Intelligent Hardware, 2021, 3(1): 43-54 DOI:10.1016/j.vrih.2020.12.002

References

1. Wllmer M, Eyben F, Reiter S, Schuller B, Cowie R. Abandoning emotion classes–towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, incorporating 12th Australasian International Conference on Speech Science and Technology. 2008, 597–600

2. Tzirakis P, Trigeorgis G, Nicolaou M A, Schuller B W, Zafeiriou S. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301–1309 DOI:10.1109/jstsp.2017.2764438

3. Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA, USA, IEEE, 2017, 2227–2231 DOI:10.1109/icassp.2017.7952552

4. Zhao Z P, Bao Z T, Zhang Z X, Deng J, Cummins N, Wang H S, Tao J H, Schuller B. Automatic assessment of depression from speech via a hierarchical attention transfer network and attention autoencoders. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(2): 423–434 DOI:10.1109/jstsp.2019.2955012

5. Zhao Z P, Bao Z T, Zhang Z X, Cummins N, Wang H S, Schuller B. Hierarchical attention transfer networks for depression assessment from speech. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain, IEEE, 2020, 7159–7163 DOI:10.1109/icassp40776.2020.9053207

6. Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. In: Proceedings of the 30th International Conference on International Conference on Machine Learning(ICML). Atlanta, GA, USA, 2013, 1310–1318

7. Dai R, Minciullo L, Garattoni L, Francesca G, Bremond F. Self-attention temporal convolutional network for long-term daily living activity detection. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Taipei, Taiwan, China, IEEE, 2019, 1–7 DOI:10.1109/avss.2019.8909841

8. Bengio S, Vinyals O, Jaitly N, Shazeer N. Scheduled sampling for sequence prediction with recurrent Neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1. Montreal, Canada, MITPress, 2015, 1171–1179

9. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K. WaveNet: a generative model for raw audio. 2016

10. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, California, USA, Curran Associates Inc. 2017, 6000–6010

11. Scialom T, Piwowarski B, Staiano J. Self-attention architectures for answer-agnostic neural question generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, Stroudsburg, PA, USA, Association for Computational Linguistics, 2019, 6027–6032 DOI:10.18653/v1/p19-1604

12. Shen T, Zhou T, Long G, Jiang J, Pan S, Zhang C. DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding. 2017

13. Li X P, Song J K, Gao L L, Liu X L, Huang W B, He X N, Gan C. Beyond RNNs: positional self-attention with Co-attention for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8658–8665 DOI:10.1609/aaai.v33i01.33018658

14. Tarantino L, Garner P N, Lazaridis A. Self-attention for speech emotion recognition. In: Interspeech. ISCA, 2019 DOI:10.21437/interspeech.2019-2822

15. Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J, Quatieri T F. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 2015, 71: 10–49 DOI:10.1016/j.specom.2015.03.004

16. Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. 2016

17. Li J, Wong Y, Zhao Q, Kankanhalli M S. Attention transfer from web images for video recognition. In: Proceedings of the 25th ACM international conference on Multimedia. Mountain View, California, USA, Association for Computing Machinery, 2017, 1–9 DOI:10.1145/3123266.3123432

18. Zhuo J B, Wang S H, Zhang W G, Huang Q M. Deep unsupervised convolutional domain adaptation. In: Proceedings of the 25th ACM international conference on Multimedia. Mountain View California USA, New York, NY, USA, ACM, 2017 DOI:10.1145/3123266.3123292

19. Kim J, Park S, Kwak N. Paraphrasing complex network: network compression via factor transfer. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada, Curran Associates Inc, 2018, 2765–2774

20. Meng H, Yan T H, Yuan F, Wei H W. Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access, 2019, 7: 125868–125881 DOI:10.1109/access.2019.2938007

21. Chen M Y, He X J, Yang J, Zhang H. 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 2018, 25(10): 1440–1444 DOI:10.1109/lsp.2018.2860246

22. Mao Q R, Dong M, Huang Z W, Zhan Y Z. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 2014, 16(8): 2203–2213 DOI:10.1109/tmm.2014.2360798

23. Huang Z W, Dong M, Mao Q R, Zhan Y Z. Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. New York, NY, USA, ACM, 2014 DOI:10.1145/2647868.2654984

24. Zhang Y Y, Du J, Wang Z R, Zhang J S, Tu Y H. Attention based fully convolutional network for speech emotion recognition. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Honolulu, HI, USA, IEEE, 2018 DOI:10.23919/apsipa.2018.8659587

25. Tzinis E, Potamianos A. Segment-based speech emotion recognition using recurrent neural networks. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). San Antonio, TX, USA, IEEE, 2017, 190–195 DOI:10.1109/acii.2017.8273599

26. Huang C W, Narayanan S S. Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: Interspeech 2016. ISCA, 2016 DOI:10.21437/interspeech.2016-448

27. Bai S J, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018

28. Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 1003–1012 DOI:10.1109/cvpr.2017.113

29. Du Z Y, Wu S W, Huang D, Li W X, Wang Y H. Spatio-temporal encoder-decoder fully convolutional network for video-based dimensional emotion recognition. IEEE Transactions on Affective Computing, 2019 DOI:10.1109/taffc.2019.2940224

30. Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech recognition. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada. MIT Press, 2015, 577–585

31. Vinyals O, Kaiser L, Koo T, Petrov S, Sutskever I, Hinton G. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada, MIT Press, 2015, 2773–2781

32. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings 3rd International Conference on Learning Representations (ICLR). San Diego, CA, USA, 2014

33. Zhao Z P, Zheng Y, Zhang Z X, Wang H S, Zhao Y Q, Li C. Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1477

34. Li Y C, Zhao T, Kawahara T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In: Interspeech 2019. ISCA, 2019 DOI:10.21437/interspeech.2019-2594

35. Salazar J, Kirchhoff K, Huang Z H. Self-attention networks for connectionist temporal classification in speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom, IEEE, 2019, 7115–7119 DOI:10.1109/icassp.2019.8682539

36. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A Survey on Deep Transfer Learning. In: Artificial Neural Networks and Machine Learning–ICANN. Cham. Springer International Publishing, 2018, 270–279

37. Deng J, Xu X Z, Zhang Z X, Frühholz S, Schuller B. Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(1): 31–43 DOI:10.1109/taslp.2017.2759338

38. Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017, 7130–7138 DOI:10.1109/cvpr.2017.754

39. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: Hints for thin deep nets. In: Proceedings 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015

40. Haque A, Guo M, Verma P, Li F F. Audio-linguistic embeddings for spoken sentences. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom, IEEE, 2019, 7355–7359 DOI:10.1109/icassp.2019.8682553

41. Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335–359 DOI:10.1007/s10579-008-9076-6

42. Zhao Z P, Bao Z T, Zhang Z X, Cummins N, Wang H S, Schuller B W. Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Interspeech 2019. ISCA, 2019 DOI:10.21437/interspeech.2019-1649

43. Lenzo K. The CMU pronouncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 2007

44. Han K, Yu D, Tashev I. Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech. 2014

45. Lee J, Tashev I. High-level feature representation using recurrent neural network for speech emotion recognition. In: Proc. INTERSPEECH. Dresden, Germany, 2015, 1537–1540

46. Mao S Y, Tao D H, Zhang G Y, Ching P C, Lee T. Revisiting hidden Markov models for speech emotion recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, United Kingdom, IEEE, 2019, 6715–6719 DOI:10.1109/icassp.2019.8683172

Related

1. Wanlu ZHENG, Wenming ZHENG, Yuan ZONG, Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition Virtual Reality & Intelligent Hardware 2021, 3(1): 65-75