Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board

Emotion recognition for human-computer interaction

Emotion recognition is to quantify,describe and recognize different emotional states through thebehavioral and physiological responses generated from emotional expressions.Emotion recognition is animportant field due to its wide applications in many tasks,such as dialogue generation,social mediaanalysis and intelligent system.It builds a harmonious human-computer environment by enabling thecomputer systems and devices to recognize and interpret human affects.Emotion recognition models arebuilt using multimodal information such as audio,video,text and so on.It is important to consider emotioncharacteristics of humans in the design and presentation of intelligent interaction.We have selected sevenpapers that provide the latest updates on the development of emotion recognition technology coveringmicro-expression spotting and recognition,speech emotion recognition,physiological signal emotionrecognition,emotional dialog generation and so on.
View Abstracts Download Citations pdf Download E-Journal

EndNote

Reference Manager

ProCite

BiteTex

RefWorks

News

The expert committee of shandong virtual reality manu-facturing innovation center established in Qingdao

2021, 3(1) : 0-1

DOI:10.3724/SP.J.2096-5796.2021.01.01

PDF (27) HTML (440)

Editorial

Emotion recognition for human-computer interaction

2021, 3(1) : 1-2

DOI:10.3724/SP.J.2096-5796.2021.01.02

PDF (23) HTML (420)

Review

Review of micro-expression spotting and recognition in video sequences

2021, 3(1) : 1-17

DOI:10.1016/j.vrih.2020.10.003

Abstract (483) PDF (24) HTML (458)
Facial micro-expressions are short and imperceptible expressions that involuntarily reveal the true emotions that a person may be attempting to suppress, hide, disguise, or conceal. Such expressions can reflect a person's real emotions and have a wide range of application in public safety and clinical diagnosis. The analysis of facial micro-expressions in video sequences through computer vision is still relatively recent. In this research, a comprehensive review on the topic of spotting and recognition used in micro-expression analysis databases and methods, is conducted, and advanced technologies in this area are summarized. In addition, we discuss challenges that remain unresolved alongside future work to be completed in the field of micro-expression analysis.

Article

Emotional dialog generation via multiple classifiers based on a generative adversarial network

2021, 3(1) : 18-32

DOI:10.1016/j.vrih.2020.12.001

Abstract (475) PDF (25) HTML (414)
Background
Human-machine dialog generation is an essential topic of research in the field of natural language processing. Generating high-quality, diverse, fluent, and emotional conversation is a challenging task. Based on continuing advancements in artificial intelligence and deep learning, new methods have come to the forefront in recent times. In particular, the end-to-end neural network model provides an extensible conversation generation framework that has the potential to enable machines to understand semantics and automatically generate responses. However, neural network models come with their own set of questions and challenges. The basic conversational model framework tends to produce universal, meaningless, and relatively "safe" answers.
Methods
Based on generative adversarial networks (GANs), a new emotional dialog generation framework called EMC-GAN is proposed in this study to address the task of emotional dialog generation. The proposed model comprises a generative and three discriminative models. The generator is based on the basic sequence-to-sequence (Seq2Seq) dialog generation model, and the aggregate discriminative model for the overall framework consists of a basic discriminative model, an emotion discriminative model, and a fluency discriminative model. The basic discriminative model distinguishes generated fake sentences from real sentences in the training corpus. The emotion discriminative model evaluates whether the emotion conveyed via the generated dialog agrees with a pre-specified emotion, and directs the generative model to generate dialogs that correspond to the category of the pre-specified emotion. Finally, the fluency discriminative model assigns a score to the fluency of the generated dialog and guides the generator to produce more fluent sentences.
Results
Based on the experimental results, this study confirms the superiority of the proposed model over similar existing models with respect to emotional accuracy, fluency, and consistency.
Conclusions
The proposed EMC-GAN model is capable of generating consistent, smooth, and fluent dialog that conveys pre-specified emotions, and exhibits better performance with respect to emotional accuracy, consistency, and fluency compared to its competitors.
NAS-HR: Neural architecture search for heart rate estima-tion from face videos

2021, 3(1) : 33-42

DOI:10.1016/j.vrih.2020.10.002

Abstract (442) PDF (20) HTML (471)
Background
In anticipation of its great potential application to natural human-computer interaction and health monitoring, heart-rate (HR) estimation based on remote photoplethysmography has recently attracted increasing research attention. Whereas the recent deep-learning-based HR estimation methods have achieved promising performance, their computational costs remain high, particularly in mobile-computing scenarios.
Methods
We propose a neural architecture search approach for HR estimation to automatically search a lightweight network that can achieve even higher accuracy than a complex network while reducing the computational cost. First, we define the regions of interests based on face landmarks and then extract the raw temporal pulse signals from the R, G, and B channels in each ROI. Then, pulse-related signals are extracted using a plane-orthogonal-to-skin algorithm, which are combined with the R and G channel signals to create a spatial-temporal map. Finally, a differentiable architecture search approach is used for the network-structure search.
Results
Compared with the state-of-the-art methods on the public-domain VIPL-HR and PURE databases, our method achieves better HR estimation performance in terms of several evaluation metrics while requiring a much lower computational cost1.
Self-attention transfer networks for speech emotion recognition

2021, 3(1) : 43-54

DOI:10.1016/j.vrih.2020.12.002

Abstract (441) PDF (17) HTML (486)
Background
A crucial element of human-machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is learning robust and discriminative representations from speech. Although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Herein, we apply the log-Mel spectrogram with deltas and delta-deltas as inputs. Moreover, given that emotions are time-dependent, we apply temporal convolutional neural networks to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm to learn long-term dependencies. The self-attention transfer network (SATN) in our proposed approach takes advantage of attention transfer to learn attention from speech recognition, followed by transferring this knowledge into SER. An evaluation built on Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrates the effectiveness of the proposed model.
Learning long-term temporal contexts using skip RNN for continuous emotion recognition

2021, 3(1) : 55-64

DOI:10.1016/j.vrih.2020.11.005

Abstract (413) PDF (6) HTML (338)
Background
Continuous emotion recognition as a function of time assigns emotional values to every frame in a sequence. Incorporating long-term temporal context information is essential for continuous emotion recognition tasks.
Methods
For this purpose, we employ a window of feature frames in place of a single frame as inputs to strengthen the temporal modeling at the feature level. The ideas of frame skipping and temporal pooling are utilized to alleviate the resulting redundancy. At the model level, we leverage the skip recurrent neural network to model the long-term temporal variability by skipping trivial information for continuous emotion recognition.
Results
The experimental results using the AVEC 2017 database demonstrate that our proposed methods are beneficial to a performance improvement. Further, the skip long short-term memory (LSTM) model can focus on the critical emotional state when training the models, thereby achieving a better performance than the LSTM model and other methods.
Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition

2021, 3(1) : 65-75

DOI:10.1016/j.vrih.2020.11.006

Abstract (364) PDF (11) HTML (431)
Background
One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech. In recent years, the challenging problem of cross-corpus speech emotion recognition (SER) has generated extensive research. Nevertheless, the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.
Methods
This paper introduces a novel multi-scale discrepancy adversarial (MSDA) network for conducting multiple timescales domain adaptation for cross-corpus SER,
i. e., 
integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains. Specifically, we extract two kinds of speech features, i.e., handcraft features and deep features, from three timescales of global, local, and hybrid levels. In each timescale, the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.
Results
Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER. The MSDA is affected by the strong discriminate power provided by the adversarial process, where three discriminators are working in tandem with an emotion classifier. Accordingly, the MSDA achieves the best performance over all other baseline methods.
Conclusions
The proposed architecture was tested on a combination of one Chinese and two English datasets. The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.
Frustration recognition from speech during game interaction using wide residual networks

2021, 3(1) : 76-86

DOI:10.1016/j.vrih.2020.10.004

Abstract (440) PDF (13) HTML (360)
Background
Although frustration is a common emotional reaction while playing games, an excessive level of frustration can negatively impact a user's experience, discouraging them from further game interactions. The automatic detection of frustration can enable the development of adaptive systems that can adapt a game to a user's specific needs through real-time difficulty adjustment, thereby optimizing the player's experience and guaranteeing game success. To this end, we present a speech-based approach for the automatic detection of frustration during game interactions, a specific task that remains under-explored in research.
Method
The experiments were performed on the Multimodal Game Frustration Database (MGFD), an audiovisual dataset—collected within the Wizard-of-Oz framework—that is specially tailored to investigate verbal and facial expressions of frustration during game interactions. We explored the performance of a variety of acoustic feature sets, including Mel-Spectrograms, Mel-Frequency Cepstral Coefficients (MFCCs), and the low-dimensional knowledge-based acoustic feature set eGeMAPS. Because of the continual improvements in speech recognition tasks achieved by the use of convolutional neural networks (CNNs), unlike the MGFD baseline, which is based on the Long Short-Term Memory (LSTM) architecture and Support Vector Machine (SVM) classifier—in the present work, we consider typical CNNs, including ResNet, VGG, and AlexNet. Furthermore, given the unresolved debate on the suitability of shallow and deep networks, we also examine the performance of two of the latest deep CNNs: WideResNet and EfficientNet.
Results
Our best result, achieved with WideResNet and Mel-Spectrogram features, increases the system performance from 58.8% unweighted average recall (UAR) to 93.1% UAR for speech-based automatic frustration recognition.