Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition
Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing 210096, China
Abstract
Keywords: Human-computer interaction ; Cross-corpus speech emotion recognition ; Hierarchical discri-minators ; Domain adaptation
Content

Corpus | Languages | Subjects | Utterance | Emotions |
---|---|---|---|---|
IEMOCAP | English | 10 (5 males) | 5527 | Neutral, Happy, Angry, Sad |
CASIA | Chinese | 4 (2 males) | 800 | Neutral, Happy, Angry, Sad |
MSP-improv | English | 12 (6 males) | 1282 | Neutral, Happy, Angry, Sad |
Source | Target | TCA | DANN | MSDA | |||
---|---|---|---|---|---|---|---|
WA (%) | UA (%) | WA (%) | UA (%) | WA (%) | UA (%) | ||
CASIA | IEMOCAP | 30.17 | 28.80 | 29.56 | 30.37 | 34.88 | 32.51 |
IEMOCAP | CASIA | 31.25 | 31.25 | 29.87 | 29.87 | 40.13 | 40.13 |
CASIA | MSP-improv | 28.16 | 28.83 | 32.10 | 30.77 | 37.19 | 29.70 |
MSP-improv | CASIA | 29.25 | 29.25 | 29.38 | 29.38 | 37.38 | 37.38 |
IEMOCAP | MSP-improv | 30.21 | 28.25 | 33.25 | 30.03 | 43.43 | 34.59 |
MSP-improv | IEMOCAP | 29.31 | 27.15 | 33.25 | 28.94 | 36.86 | 33.15 |
Source | Target | Local D | Hybrid D | MSDA | |||
---|---|---|---|---|---|---|---|
WA (%) | UA (%) | WA (%) | UA (%) | WA (%) | UA (%) | ||
CASIA | IEMOCAP | 31.51 | 32.85 | 31.01 | 30.76 | 34.88 | 32.51 |
IEMOCAP | CASIA | 32.87 | 32.87 | 32.87 | 32.87 | 40.13 | 40.13 |
CASIA | MSP-improv | 32.43 | 27.85 | 32.43 | 27.85 | 37.19 | 29.70 |
MSP-improv | CASIA | 33.63 | 33.63 | 33.63 | 33.63 | 37.38 | 37.38 |
IEMOCAP | MSP-improv | 33.42 | 33.05 | 33.42 | 33.05 | 43.43 | 34.59 |
MSP-improv | IEMOCAP | 31.13 | 32.16 | 35.86 | 30.67 | 36.86 | 33.15 |
Source | Target | Hybrid without D | MSDA | ||
---|---|---|---|---|---|
WA (%) | UA (%) | WA (%) | UA (%) | ||
CASIA | IEMOCAP | 27.32 | 24.45 | 34.88 | 32.51 |
IEMOCAP | CASIA | 30.13 | 30.13 | 40.13 | 40.13 |
CASIA | MSP-improv | 30.21 | 27.13 | 37.19 | 29.70 |
MSP-improv | CASIA | 27.37 | 27.37 | 37.38 | 37.38 |
IEMOCAP | MSP-improv | 28.16 | 25.08 | 43.43 | 34.59 |
MSP-improv | IEMOCAP | 33.62 | 24.96 | 36.86 | 33.15 |
Reference
Swain M, Routray A, Kabisatpathy P. Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 2018, 21(1): 93–120 DOI:10.1007/s10772-018-9491-z
Vrysas N, Kotsakis R, Liatsou A, Dimoulas C, Kalliris G. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 2018, 66(6): 457–467 DOI:10.17743/jaes.2018.0036
Kotsakis R, Dimoulas C, Kalliris G, Veglis A. Emotional prediction and content profile estimation in evaluating audiovisual mediated communication. International Journal of Monitoring and Surveillance Technologies Research, 2014, 2(4): 62–80 DOI:10.4018/ijmstr.2014100104
Gideon J, McInnis M, Mower Provost E. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Transactions on Affective Computing, 2019: 1 DOI:10.1109/taffc.2019.2916092
Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Transactions on Affective Computing, 2010, 1(2): 119–131 DOI:10.1109/t-affc.2010.8
Ntalampiras S. Toward language-agnostic speech emotion recognition. Journal of the Audio Engineering Society, 2020, 68(1/2): 7–13 DOI:10.17743/jaes.2019.0045
Sahu S, Gupta R, Espy-Wilson C. On enhancing speech emotion recognition using generative adversarial networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1883
Bao F, Neumann M, Vu N T. CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. In: Interspeech 2019. ISCA, 2019, 35–37 DOI:10.21437/interspeech.2019-2293
Han J, Zhang Z, Ren Z, Ringeval F, Schuller B. Towards conditional adversarial training for predicting emotions from speech. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, 6822–6826 DOI:10.1109/ICASSP.2018.8462579
Salamon J, Bello J P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 2017, 24(3): 279–283 DOI:10.1109/lsp.2017.2657381
Abdelwahab M, Busso C. Domain adversarial for acoustic emotion recognition. ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423–2435 DOI:10.1109/taslp.2018.2867099
Pan S J, Tsang I W, Kwok J T, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199–210 DOI:10.1109/tnn.2010.2091281
Ververidis D, Kotropoulos C. Emotional speech recognition: resources, features, and methods. Speech Communication, 2006, 48(9): 1162–1181 DOI:10.1016/j.specom.2006.04.003
Fayek H M, Lech M, Cavedon L. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 2017, 92: 60–68 DOI:10.1016/j.neunet.2017.02.013
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A. Acoustic emotion recognition: A benchmark comparison of performances. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. 2009, 552–557 DOI:10.1109/ASRU.2009.5372886
Jeon J H, Xia R, Liu Y. Sentence level emotion recognition based on decisions from subsentence segments. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011, 4940–4943 DOI:10.1109/ICASSP.2011.5947464
Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G. Continuous speech emotion recognition with convolutionacnnl neural networks. Journal of the Audio Engineering Society, 2020. 68(1/2), 14–24
Zhang S, Huang T, Gao W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 2018, 20(6):1576–1590 DOI:10.1109/TMM.2017.2766843
Vrysis L, Tsipas N, Thoidis I, Dimoulas C. 1D/2D deep CNNs vs. temporal feature integration for general audio classification. Journal of the Audio Engineering Society, 2020, 68(1/2), 66–77
Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing. MIT Press, 2014, 2672–2680
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V. Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research, 2016, 17(1): 2096–2030
Müller C. The interspeech 2010 paralinguistic challenge. Proc Interspeech, 2010, 2794–2797
Eyben F, Wöllmer M, Schuller B. Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia. Firenze, Italy, Association for Computing Machinery, 2010, 1459–1462 DOI:10.1145/1873951.1874246
Qian K, Zhang Y, Chang S, Cox D, Hasegawa-Johnson M. Unsupervised speech decomposition via triple information bottleneck. 2020
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 2012, 3(2):184–198 DOI:10.1109/T-AFFC.2011.40
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Computer Science, 2014
Vinyals O, Kaiser L, Koo T, Petrov S, Sutskever I, Hinton G. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. Montreal, Canada, Press MIT, 2015, 2773–2781
Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335 DOI:10.1007/s10579-008-9076-6
The selected speech emotion database of institute of automation chinese academy of sciences (CASIA). http://www.datatang.com/ data/39277
Busso C, Parthasarathy S, Burmania A, AbdelWahab M, Sadoughi N, Provost E M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2017, 8(1):67–80 DOI:10.1109/TAFFC.2016.2515617
Li H, Tu M, Huang J, Narayanan S, Georgiou P. Speaker-invariant affective representation learning via adversarial training. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 7144–7148 DOI:10.1109/ICASSP40776.2020.9054580.
Sayan G, Eugene L, Louis-Philippe M, Stefan S. Representation learning for speech emotionrecognition. Interspeech, 2016, 3603–3607
Xu Y, Xu H, Zou J. HGFM: A hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, 6499–6503 DOI:10.1109/ICASSP40776.2020.9053039