Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board
<< Previous

2021, 3(1): 76-86

Published Date:2021-2-20 DOI: 10.1016/j.vrih.2020.10.004

Frustration recognition from speech during game interaction using wide residual networks


Although frustration is a common emotional reaction while playing games, an excessive level of frustration can negatively impact a user's experience, discouraging them from further game interactions. The automatic detection of frustration can enable the development of adaptive systems that can adapt a game to a user's specific needs through real-time difficulty adjustment, thereby optimizing the player's experience and guaranteeing game success. To this end, we present a speech-based approach for the automatic detection of frustration during game interactions, a specific task that remains under-explored in research.
The experiments were performed on the Multimodal Game Frustration Database (MGFD), an audiovisual dataset—collected within the Wizard-of-Oz framework—that is specially tailored to investigate verbal and facial expressions of frustration during game interactions. We explored the performance of a variety of acoustic feature sets, including Mel-Spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), and the low-dimensional knowledge-based acoustic feature set eGeMAPS. Because of the continual improvements in speech recognition tasks achieved by the use of convolutional neural networks (CNNs), unlike the MGFD baseline, which is based on the Long Short-Term Memory (LSTM) architecture and Support Vector Machine (SVM) classifier—in the present work, we consider typical CNNs, including ResNet, VGG, and AlexNet. Furthermore, given the unresolved debate on the suitability of shallow and deep networks, we also examine the performance of two of the latest deep CNNs: WideResNet and EfficientNet.
Our best result, achieved with WideResNet and Mel-Spectrogram features, increases the system performance from 58.8% unweighted average recall (UAR) to 93.1% UAR for speech-based automatic frustration recognition.


Frustration recognition ; WideResNets ; Machine learning

Cite this article

Meishu SONG, Adria MALLOL-RAGOLTA, Emilia PARADA-CABALEIRO, Zijiang YANG, Shuo LIU, Zhao REN, Ziping ZHAO, Björn W. SCHULLER. Frustration recognition from speech during game interaction using wide residual networks. Virtual Reality & Intelligent Hardware, 2021, 3(1): 76-86 DOI:10.1016/j.vrih.2020.10.004


1. Scheirer J, Fernandez R, Klein J, Picard R W. Frustrating the user on purpose: a step toward building an affective computer. Interacting with Computers, 2002, 14(2): 93–118 DOI:10.1016/s0953-5438(01)00059-5

2. Caroux L, Isbister K, Le B L, Vibert N. Player-video game interaction: a systematic review of current concepts. Computers in Human Behavior, 2015, 48: 366–381 DOI:10.1016/j.chb.2015.01.066

3. Craig S D, D'Mello S, Witherspoon A, Graesser A. Emote aloud during learning with AutoTutor: applying the facial action coding system to cognitive–affective states during learning. Cognition & Emotion, 2008, 22(5): 777–788 DOI:10.1080/02699930701516759

4. Picard R W, Klein J. Computers that recognise and respond to user emotion: theoretical and practical implications. Interacting with Computers, 2002, 14(2): 141–169 DOI:10.1016/s0953-5438(01)00055-8

5. Yannakakis G N, Isbister K, Paiva A, Karpouzis K. Emotion in games. IEEE Transactions on Affective Computing,2014, 5(1): 1–2

6. Gilleade K M, Dix A. Using frustration in the design of adaptive videogames. In: Proceedings of the 2004 ACM SIGCHI International Conference on Advances in Computer Entertainment Technology-ACE '04. Singapore, New York, ACM Press, 2004 DOI:10.1145/1067343.1067372

7. Picard R W. Affective computing. MIT Press, Cambridge, MA, USA, 2000

8. Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Transactions on Affective Computing, 2010, 1(2): 119–131 DOI:10.1109/t-affc.2010.8

9. Song M S, Yang Z J, Baird A, Parada-Cabaleiro E, Zhang Z X, Zhao Z P, Schuller B. Audiovisual analysis for recognising frustration during game-play: introducing the multimodal game frustration database. In: 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). Cambridge, United Kingdom, IEEE, 2019, 517–523 DOI:10.1109/acii.2019.8925464

10. Li C K, Wang P C, Wang S, Hou Y H, Li W Q. Skeleton-based action recognition using LSTM and CNN. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). Hong Kong, China, IEEE, 2017, 585–590 DOI:10.1109/icmew.2017.8026287

11. Zhao J F, Mao X, Chen L J. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control, 2019, 47: 312–323 DOI:10.1016/j.bspc.2018.08.035

12. Kollias D, Zafeiriou S. A multi-component CNN-RNN approach for dimensional emotion recognition in-the-wild. 2018

13. Zainab M, Usmani A R, Mehrban S, Hussain M. FPGA based implementations of RNN and CNN: a brief analysis. In: 2019 International Conference on Innovative Computing (ICIC). Lahore, Pakistan, IEEE, 2019, 1–8 DOI:10.1109/icic48496.2019.8966676

14. Yin W P, Kann K, Yu M, Schütze H. Comparative study of CNN and RNN for natural language processing. 2017

15. Keren G, Schuller B. Convolutional RNN: an enhanced model for extracting features from sequential data. In: 2016 International Joint Conference on Neural Networks (IJCNN). Vancouver, BC, Canada, IEEE, 2016, 3412–3419 DOI:10.1109/ijcnn.2016.7727636

16. Baird A, Amiriparian S, Cummins N. Automatic classification of autistic child vocalisations: a novel database and results. In: Proceedings of Interspeech 2017. Stockholm, Sweden, International Speech Communication Association, 2017

17. He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, IEEE, 2016, 770–778 DOI:10.1109/cvpr.2016.90

18. He K M, Zhang X Y, Ren S Q, Sun J. Identity mappings in deep residual networks. In: Computer Vision–ECCV 2016. Cham, Springer International Publishing, 2016, 630–645 DOI:10.1007/978-3-319-46493-0_38

19. Zagoruyko S, Komodakis N. Wide residual networks. Computer Science ArXiv, 2016

20. Angkititrakul P, Petracca M, Sathyanarayana A, Hansen J H L. UTDrive: driver behavior and speech interactive systems for in-vehicle environments. In: 2007 IEEE Intelligent Vehicles Symposium. Istanbul, Turkey, IEEE, 2007, 566–569 DOI:10.1109/ivs.2007.4290175

21. Boril H, Sadjadi S O, Kleinschmidt T. Analysis and detection of cognitive load and frustration in drivers' speech. In: Conference of the International Speech Communication Association. Makuhari, Chiba, Japan, DBLP, 2010

22. Arunachalam S, Gould D, Andersen E, Byrd D, Narayanan S. Politeness and frustration language in child-machine interactions. In: 7th European Conference on Speech Communication and Technology. 2001

23. Koelstra S, Muhl C, Soleymani M, Lee J S, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I. DEAP: a database for emotion Analysis;Using physiological signals. IEEE Transactions on Affective Computing, 2012, 3(1): 18–31 DOI:10.1109/t-affc.2011.15

24. Szwoch M. FEEDB: a multimodal database of facial expressions and emotions. In: 2013 6th International Conference on Human System Interactions (HSI). Sopot, Poland, IEEE, 2013, 524–531 DOI:10.1109/hsi.2013.6577876

25. Douglas-Cowie E, Campbell N, Cowie R, Roach P. Emotional speech: towards a new generation of databases. Speech Communication, 2003, 40(1/2): 33–60 DOI:10.1016/s0167-6393(02)00070-5

26. Busso C, Bulut M, Lee C C. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4):335–359 DOI:10.1007/s10579-008-9076-6

27. Chen L S. Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction. University of Illinois at Urbana-Champaign, 2000

28. Aggarwal V, Wang W L, Eriksson B, Sun Y F, Wang W Q. Wide compression: tensor ring nets. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018, 9329–9338 DOI:10.1109/cvpr.2018.00972

29. Martinel N, Foresti G L, Micheloni C. Wide-slice residual networks for food recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Tahoe, NV, USA, IEEE, 2018, 567–576 DOI:10.1109/wacv.2018.00068

30. Zhangy Y, Ozayy M, Li S H, Okatani T. Truncating wide networks using binary tree architectures. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy, IEEE, 2017, 2116–2124 DOI:10.1109/iccv.2017.231

31. Ferreira C A, Aresta G, Cunha A, Mendonça A M, Campilho A. Wide residual network for lung-rads™ screening referral. In: 2019 IEEE 6th Portuguese Meeting on Bioengineering (ENBENG). Lisbon, Portugal, IEEE, 2019, 1–4 DOI:10.1109/enbeng.2019.8692560

32. Zerhouni E, Lányi D, Viana M, Gabrani M. Wide residual networks for mitosis detection. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). Melbourne, VIC, Australia, IEEE, 2017, 924–928 DOI:10.1109/isbi.2017.7950667

33. Panda A, Naskar R, Rajbans S, Pal S. A 3D wide residual network with perceptual loss for brain MRI image denoising. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). Kanpur, India, IEEE, 2019, 1–7 DOI:10.1109/icccnt45670.2019.8944535

34. Dahlbäck N, Jönsson A, Ahrenberg L. Wizard of Oz studies: why and how. In: Proceedings of the 1st International Conference on Intelligent User interfaces-IUI '93. Orlando, Florida, USA, NewYork, ACMPress, 1993 DOI:10.1145/169891.169968

35. Xue W, Cucchiarini C, van Hout R, Strik H. Acoustic correlates of speech intelligibility: the usability of the eGeMAPS feature set for atypical speech. In: SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education. ISCA, 2019 DOI:10.21437/slate.2019-9

36. Valstar M, Pantic M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R. AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC'16. Amsterdam, TheNetherlands, Press ACM, 2016 DOI:10.1145/2988257.2988258

37. Shen J, Pang R, Weiss R J. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing. IEEE, 2018

38. Eyben F, Weninger F, Gross F, Schuller B. Recent developments in open SMILE, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia-MM'13. Barcelona, Spain, Press ACM, 2013 DOI:10.1145/2502081.2502224

39. Schmitt M, Ringeval F, Schuller B. At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Interspeech 2016. ISCA, 2016 DOI:10.21437/interspeech.2016-1124

40. Geiger J T, Hofmann M, Schuller B, Rigoll G. Gait-based person identification by spectral, cepstral and energy-related audio features. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, BC, Canada, IEEE, 2013, 458–462 DOI:10.1109/icassp.2013.6637689

41. Winursito A, Hidayat R, Bejo A. Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition. In: 2018 International Conference on Information and Communications Technology (ICOIACT). Yogyakarta, Indonesia, IEEE, 2018, 379–383 DOI:10.1109/icoiact.2018.8350748

42. Eyben F, Scherer K R, Schuller B W, Sundberg J, André E, Busso C, Devillers L Y, Epps J, Laukka P, Narayanan S S, Truong K P. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016, 7(2): 190–202 DOI:10.1109/taffc.2015.2457417

43. Cummins N, Amiriparian S, Hagerer G, Batliner A, Steidl S, Schuller B W. An image-based deep spectrum feature representation for the recognition of emotional speech. In: Proceedings of the 2017 ACM on Multimedia Conference- MM '17. Mountain View, California, USA, ACM Press, 2017 DOI:10.1145/3123266.3123371

44. Mousavi S M, Zhu W Q, Sheng Y X, Beroza G C. CRED: a deep residual network of convolutional and recurrent units for earthquake signal detection. Scientific Reports, 2019, 9: 10267 DOI:10.1038/s41598-019-45748-1

45. Xie X, Zhang L, Wang J. Application of residual network to infant crying recognition. Journal of Electronics and Information Technology, 2019, 41(1): 233–239 DOI:10.11999/JEIT180276

46. Liang G B, Zheng L X. A transfer learning method with deep residual network for pediatric pneumonia diagnosis. Computer Methods and Programs in Biomedicine, 2020, 187: 104964 DOI:10.1016/j.cmpb.2019.06.023

47. Xiao L S, Yan Q, Deng S Y. Scene classification with improved AlexNet model. In: 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE). Nanjing, China, IEEE, 2017, 1–6 DOI:10.1109/iske.2017.8258820

48. Tan M X, Le Q V. EfficientNet: rethinking model scaling for convolutional neural networks. 2019

49. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 2015


1. Qian CHEN, Yue WANG, Hui WANG, Xubo YANG, Data-driven simulation in fluids animation: A survey Virtual Reality & Intelligent Hardware 2021, 3(2): 87-104