Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2021,  3 (1):   65 - 75

Published Date:2021-2-20 DOI: 10.1016/j.vrih.2020.11.006


One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech. In recent years, the challenging problem of cross-corpus speech emotion recognition (SER) has generated extensive research. Nevertheless, the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.
This paper introduces a novel multi-scale discrepancy adversarial (MSDA) network for conducting multiple timescales domain adaptation for cross-corpus SER,
i. e., 
integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains. Specifically, we extract two kinds of speech features, i.e., handcraft features and deep features, from three timescales of global, local, and hybrid levels. In each timescale, the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.
Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER. The MSDA is affected by the strong discriminate power provided by the adversarial process, where three discriminators are working in tandem with an emotion classifier. Accordingly, the MSDA achieves the best performance over all other baseline methods.
The proposed architecture was tested on a combination of one Chinese and two English datasets. The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.


1 Introduction
Accurate emotion recognition is a significant challenge for human-computer interaction scenarios because a machine cannot understand the emotional state of the speaker. It thus fails to interpret the speaker's emotions. Human speech conveys a rich stream of information that naturally communicates emotions between humans. Consequently, a new research field—speech emotion recognition (SER)—has emerged. It is a process that utilizes audio signals to decode the emotional states of individuals[1-3].
In addition to emotion, other factors, including environmental noise, spoken language, and recording device quality changes across datasets can also influence the performance of the recognition[4]. Moreover, most existing SERs focus on using training and testing data within one labeled corpus. In most scenarios, they focus on a single language owing to the scarcity of data and the high cost of annotating new data. However, for actual human-computer interaction scenarios, an immense dataset bias exists between diverse conditions, which reduces their generalization to novel data. Accordingly, the performance of such systems will decline. Researchers have been motivated by this issue to explore more robust systems for cross-corpus SER training models[4-6].
To tackle the data scarcity problem in cross-corpus SER, recent methods have focused on data augmentation for training the SER models[7-10]. One obvious flaw is that the generated new data do not follow the same distribution as those in the original database; therefore, noise can impede the recognition performance. Another mainstream method adopts the concept of transfer learning to create a more universal model by minimizing the distance of distributions among training data (source domain) and testing data (target domain)[11,12].
Speech signals have unique sequential patterns, and emotional information is distributed unevenly along the signal. In most conventional studies, to capture temporal clues of speech, the SER is regarded as a dynamic or static classification problem corresponding to two timescales, namely the sub-utterance and the whole utterance[13-16]. Accordingly, domain alignment can be separately implemented on the above two timescales. However, this approach may not be sufficiently strong to cope with the observed high variances.
Inspired by these observations, we propose a novel multi-scale discrepancy adversarial (MSDA) framework to mitigate the feature distribution discrepancy in different timescales across different corpora. The MSDA is characterized by three levels of discriminators from bottom to top, which are fed with global, local, and hybrid levels of features from the labeled source domain and unlabeled target domain. Specifically, each utterance is chunked into a fixed number of segments with a 50% overlapping ratio to form global and local speech signals for orderly extraction of low-level handcraft and high-level deep speech features. Here, we establish an intra-utterance structural relation for all segments within one utterance using an attention-based bidirectional long short term memory (Bi-LSTM) model. In addition to the two sub-level structures, we craft a hybrid-level representation blending global and local levels of features to form hybrid-level features. These three levels of features provide complementary views and remedy their weaknesses. Subsequently, domain discriminative processes on the above three levels are implemented to synergistically enhance the discriminative ability of the model. The emotion classifier can perform better across dissimilar datasets in such an aligned distribution.
Our contributions can be summarized as follows: (1) We introduce a novel method to extract emotional speech features from two kinds of timescale, i.e., the local versus the global timescales, (2) We further propose a novel hierarchical discrepancy adversarial network that eliminates the gap across corpus and languages while preserving emotion-related information, and (3) We performed extensive experiments on one Chinese dataset and two English datasets and demonstrated the superiority of our model.
2 Related work
Cross-corpus SER in real life has received increased attention owing to its extensive real-world applicability[6]. The distribution gap between the source and target domains is one of the most fundamental factors leading to poor recognition performance. Intuitively, it is crucial to discover a universal feature representation across the two domains such that the distribution gap is alleviated as possible.
To better cope with the observed high variances, feature normalization schemes have been suggested in earlier research. Schuller et al. performed speaker normalization (SN), corpus normalization (CN), and speaker corpus normalization (SCN) of each speech feature to conduct cross-corpus evaluation. To reduce the dissimilarity between domains[5], Yang et al. proposed a transfer component analysis (TCA) framework in a reproducing kernel Hilbert space using the maximum mean discrepancy[12]. By projecting data onto the learned transfer component, an out-of-sample generalization representation can be learned in the subspace.
In contrast to traditional feature alignment methods, deep convolutional neural networks (DCNN) have been shown to be of the ability to automatically extract salient emotional features from speech signals[14,17-19]. In recent years, inspired by adversarial thoughts of generative adversarial nets (GAN)[21], the authors of [7-9] implemented generative transfer models to identify more generalized representations across different speech corpora or languages. Sahu et al. used Gaussian distribution sample points to train the vanilla GAN and conditional GAN models to generate synthetic representations that emulate the original ones[7]. The synthetically generated samples, along with the real data, are then used for classification on an external corpus. Nevertheless, it is notable that the unwanted noise will also be introduced while generating the data samples. The same problems exist in [3,10,17,19], which apply artificial data in real-world conditions.
Recently, various approaches of utilizing the domain classifier[21,22] to measure the domain discrepancy had been proposed, which adopt a min-max game in the feature extractor to this end. This adversarial process is initially implemented from two authentic data distributions in the source and target domains. It thus preserves their initial data attributes. Discriminative adversarial learning has been implemented in the domain adversarial neutral network (DANN)[21] to learn a common representation across training and testing data. On this basis, the author of [11] proposed a solution to consistently extract useful information from the available unlabeled data, which improves emotion recognition performance across different speech corpora.
Speech signals have a unique sequential nature and hence SER can be regarded as a dynamic or static classification problem according to two timescales, namely frame-based and turn-based[13]. Haytham chose frame-level speech segments as input to explore feed-forward and recurrent neural network architectures[14]. Schuller et al. provided a benchmark on nine speech corpora and proved that utterance-level modeling has a significant advantage on average over frame-level[15]. In contrast, Jeon et al. developed a two-step approach to make sentence-level decisions[16]. Predictions from three sub-sentence segments (words, phrases, time-based segments) within a sentence are first generated and then combined to obtain a sentence-level decision. They demonstrated that time-based segments achieved the best performance and outperformed the sentence-based approach. The emotion recognition process in the above methods is performed within a single time scale, and this motivated us to integrate the merits of different time scales of speech signals.
3 Methods
Given a labeled source dataset
and an unlabeled target dataset
in equal proportion of data samples, the MSDA model is used to capture the multiple-timescale emotional characteristics of speech, and then alleviate the feature distribution difference between
at each corresponding timescale. In what follows, we will detail the MSDA model and then apply it to handle the cross-corpus SER task. Figure 1 shows the MSDA network architecture. There are three major modules of our model: (1) three feature extractors,
  F g
F l
F h
; (2) three domain discriminators,
  D g
D l   a n d   D h
; and (3) a single emotion state classifier
C ,
as shown in Figure 1.
3.1 Three levels of features
The goal of feature extractors is to extract more emotionally informative features to improve the SER performance. The entire process involves two phases, that is, handcraft feature extraction and deep feature learning, which are performed on three hierarchical levels. Specifically, the whole utterance, termed the "global speech signal," is chunked into
segments with an overlap ratio of 50% on the time axis to form a sequence of the "local signal", in which we assume that each of the segments contains the same affective category of the utterance to which it belongs. The IS10 feature set provided in the INTERSPEECH 2010 paralinguistic challenge[22] using the openSMILE toolkit[23] is extracted for both global and local signals. The IS10 feature set contains 1582 features in total, which include a set of acoustic low-level descriptors expressing the emotion of the speaker [24], such as the prosody features. After feature extraction, sentences of various lengths can have a fixed length for the subsequent Bi-LSTM module. Then, feature-based z-normalization is conducted to independently normalize the source samples and the target samples. The above two levels of features are denoted as
X g   1 × d
X l = [ x 1 l ;   x 2 l ;   ;   x N l ]   N × d
, respectively, where
d = 1582
is the dimensionality of the feature vector.
3.1.1 Global-level feature
To extract the global level features, we adopt a two layers of convolution network to serve as the feature extractor
F g :
G g = F g X g ,
X g
is resized from
1 × d  
14   × 113 × 1 .   F g
has two convolutional layers, followed by the global max pooling and dropout layer.
3.1.2 Local-level feature
To capture the temporal relational structure and highlight the emotional informative part, we employs an attention-based Bi-LSTM model[25] to extract more discriminative speech features among the sequence of segments. Several attention-based Bi-LSTM approaches have been proven to work efficiently in sequence-to-sequence learning tasks[26,27]. Bi-LSTM has two hidden layers (forward and backward) with cell sizes of
. Let
  L ( · )
denote the Bi-LSTM function and
H l  
denote the encoded Bi-LSTM states. Then, we have
H l = L X 1 1 ,   X 2 1 , ,   X N 1 .
Noting that the emotional information is distributed unevenly over the signal, we introduce an attention score vector α= [α 1, …, α N] to emphasize the different contributions of the features associated with the different speech segments. In this case, we can obtain the following deep local-level features
G a t t l
G a t t l   = α H l .
3.1.3 Hybrid-level feature
The above two features complement each other in the complementary timescales. To further enhance the richness of speech features, we also craft a hybrid-level representation that blends global and local levels of features to feed into another deep feature extractor
F h  
for capturing higher-level emotional features
G h
. We stack
X g
G a t t l  
in the vertical direction to form a hybrid-level feature set
X g , G a t t l     2 × d
and then resize it to
28   × 113 × 1
G h =   F h X g , G a t t l   .
In this case, we finally obtain three levels of features, i.e.,
G g ,    G l  
a n d   G h   ,  
through a progressive feature extraction procedure, which makes use of the advantage of both manually designed perceptual features and automatic feature learning. In addition, the procedure can drastically unearth the structural relation for different timescales.
3.2 Hierarchical domain discriminators
The discrepancy of feature distributions between the training and testing datasets leads to an apparent misclassification when applying the model trained on training dataset. This motivates us to propose a set of hierarchical domain discriminators,
D g ,   D l  
D h
, which acts in correspondence with the above three levels of deep speech emotional features. These features operate competitively with an emotion classifier within an adversarial process, thereby alleviating the feature distribution gap between the training and testing datasets at each level.
Domain discriminators serve as a binary classifier in essence, whose roles are to distinguish whether the data come from the source domain or the target domain. We intend to maximize the losses of the domain classifiers. The more problems they cause, the more difficult it is to judge from where the data come, revealing that, at the distribution level, the two domains have been brought closer.
We set two domain labels, that is, 0 and 1, to all the source and target data, respectively. Each domain discriminator
consists of three fully connected layers coupled with ReLU activation to generate the final outputs. The discrimination loss can be defined as a cross-entropy loss:
L d g = i d i   l o g ( D g       G S g   ,   G T g   ) ,
L d l = i d i   l o g ( D l        G S l   ,   G T l     ) ,
L d h = i d i   l o g ( D h        G S h   ,   G T h     ) ,
d i  
denote the domain label,
D g
D l  
aim to rectify the feature distribution difference between the source data
and target data
at the global and local levels, respectively,
D h
aims to alleviate the feature distribution difference across the entire hybrid level. Gradient reversal layers (GRLs) are located between the feature extractor and the domain discriminator that multiplies the gradient by a certain negative constant during the backpropagation-based training. During forward propagation, it performs as an identity transformation. The reversing operation causes the feature representation of two domains to converge, thereby reducing the gap across the two domains.
3.3 Emotion classifier
As the main task in cross-corpus SER, a single emotion classifier is designed to work in tandem with the above three different timescales distinct discriminators. The input for the emotion label predictor
is the hybrid speech features
G h
extracted in Section 3.1.3. The hybrid speech features Gh and the emotion label predictor C together form a standard feed-forward architecture.
consists of three fully connected layers coupled with the ReLU activation to generate final
is the number of emotion categories) outputs for each emotion category. The hybrid-level features
G S h
of source data
are propagated through the network, and the emotion classification loss is calculated using a cross-entropy measure:
L c = i y i   l o g ( C     G S h     ) i ,
y i
is the ground-truth label of source data
The total loss of the network attempts to minimize classification error of the emotion classifier while maximizing the error of the domain classifiers, which can be formulated as:
L   =   L c   -   λ ( L d l   +   L d g   +   L d h   ) ,  
is a regularization multiplier that has the role of controlling the tradeoff between the two parts of the losses. Meanwhile,
D g
D l
D h  
share the same parameter
determined according to the following expression:
λ p =   2 1 + e x p   ( - γ   p ) - 1 ,
where p is the training progress, which linearly changes from 0 to 1. Training of the three discriminators and one classifier is performed simultaneously, meaning that the weights are updated simultaneously to compete with each other. As the training data changes, both the domain and emotion classifiers readjust their weights to find a new representation that satisfies all conditions. Ultimately, the network will generate a domain-aligned data distribution and emotion-related features.
4 Experiments
4.1 Speech corpus
We used three datasets, namely IEMOCAP[28], CASIA[29], and MSP-improv[30], to assess the ability of our model to solve cross-corpus SER problems. These datasets differ in several dimensions: languages, number of speakers, and class distributions. They are all datasets of emotions performed by gender-balanced professional actors. In the evaluation experiments, we choose the samples of four emotion classes, i.e., neutral, angry, happy and sad, from these three corpora such that they share the same emotion classes and can serve the cross-corpus speech emotion recognition tasks. Basic information of the selected emotion corpora is shown in Table 1. We considered the category balance within a corpus because IEMOCAP is a highly imbalanced dataset. The authors of [31-33] merged excitement with happy to achieve a more balanced label distribution. We followed this setting and used the same classes for the other two corpora. For the model evaluation, two of the three corpora were chosen to serve as the source,
, and target domain,
, respectively. Then, the roles were changed. Thus, there were six groups of experiments.
Overview of selected emotion corpora
Corpus Languages Subjects Utterance Emotions
IEMOCAP English 10 (5 males) 5527 Neutral, Happy, Angry, Sad
CASIA Chinese 4 (2 males) 800 Neutral, Happy, Angry, Sad
MSP-improv English 12 (6 males) 1282 Neutral, Happy, Angry, Sad
Subjects: number of subjects; utterance: number of utterances used
IEMOCAP consists of the speech corpora of from ten professional actors. According to the recorded scenarios, each utterance can be further divided into an improvised or scripted speech section and annotated into eight emotion labels. To balance the proportion of each sample, we merged excited and happy as happy. Finally, it consisted of neutral (
), angry (
), happy (
), and sad (1083), with
for the sum. The mean utterance length was
s with a standard deviation of
CASIA is an absolute balance Chinese corpus with six different emotions: anger, fear, happiness, sadness, surprise, and neutral. It was simulated by four professional actors, and each actor provided
MSP-Improv controls fixed lexical content in each pair of conversations; however, it conveys different emotions (angry, happy, sad, neutral). In this section, we selected target-read and target-improvised sub-sections because they are the most emotion-related and distribution-balanced. The emotion distribution was neutral (524), angry (
), happy (
), and sad (
), with
for the sum. The mean length of utterance was
s with a standard deviation of
4.2 Implementation details
In all experiments, we set
N = 5
and a speech segment length of more than 250ms to present sufficient information for identifying emotions[18]. Therefore, the sub-sentence number should not be too large. Similarly, an extremely small chunk number will engender too much signal overlapping, which interferes with the model's ability to process local information. We thus trained the network with an initial learning rate of
with a momentum of 0.9 and a weight decay of
1 e - 5 .
The batch normalization[28] layer was adopted after each convolutional layer and fully connected layer. For all baseline methods, we trained over 500 epochs with a batch size of 50. We set γ as
10 .
An optimization method for Adadelta was employed by means of the TensorFlow deep learning framework.
The global feature extractor
F g
had a channel size of
and kernel sizes of
for two layers, respectively, followed by the ReLU layer, which was used as the activation function. We built an
-dimensional encoding of each utterance after the global max pooling and dropout layer. The hybrid feature extractor
F h  
had a similar structure
F g
except that its first kernel size was
. The three domain discriminators followed the same structure and hyperparameters as the emotion classifier.
5 Results
To verify the effectiveness of our model in cross-corpus SER, we conducted experiments on the following baselines. (1)TCA model[12] trained by global-level features; (2) DANN [21] trained by global-level features; (3) Local D trained by local-level features corresponding to local-level discriminator
D l
; (4) Hybrid D trained by hybrid-level features corresponding to hybrid-level discriminator
D h
; (5) Hybrid without D trained by hybrid-level features for the classifier
without a discriminator from any level; (6) The MSDA model trained by hybrid-level features for classifier
and hierarchical discriminators containing
D g D l
D h
working together. The results are shown in Tables 2-4.
We adopted both weighted accuracy (WA) and unweighted accuracy (UA) as the metric. No standard experimental paradigm exists for investigating cross-corpus SER performance. Many results are computed with randomly split training, validation, and test sets. In addition, other methods evaluate the given models in different combinations of corpus and emotion. Therefore, our findings are not directly comparable to most published results. Rather than aiming for state-of-the-art classification accuracy for these cross-corpus SER, we focused on evaluating the performance of MSDA in the classical transfer learning framework, namely TCA and DANN, as mentioned in Section 2. They were selected because various methods have been derived from them and the comparisons are more representative.
The results in Table 2 show that our MSDA model is superior to the traditional distribution alignment methods in cross-corpus SER. MSDA is a dynamic adversarial framework in which the parameters are adjusted according to the predicted results. Furthermore, it is a comprehensive neural network that minimizes the domain discrepancy at multiple timescales.
Comparison between different baseline methods on three speech corpora
Source Target TCA DANN MSDA
WA (%) UA (%) WA (%) UA (%) WA (%) UA (%)
CASIA IEMOCAP 30.17 28.80 29.56 30.37 34.88 32.51
IEMOCAP CASIA 31.25 31.25 29.87 29.87 40.13 40.13
CASIA MSP-improv 28.16 28.83 32.10 30.77 37.19 29.70
MSP-improv CASIA 29.25 29.25 29.38 29.38 37.38 37.38
IEMOCAP MSP-improv 30.21 28.25 33.25 30.03 43.43 34.59
MSP-improv IEMOCAP 29.31 27.15 33.25 28.94 36.86 33.15
In Table 3, Global D is not shown because it is essentially the same as DANN. According to the table, Local D takes advantage of the sequence nature of speech signals and, consequently, it has a slight increase compared to utterance-level features only. It is convincing to attribute the improvements to the Bi-LSTM attention module. On this basis, Hybrid D absorbs hybrid-level features combining global-level and local-level features and then conducts adversarial training, which gains an obvious increase of
5.75 %
in WA compared to DANN from MSP-Improv to CASIA. Powered by stronger discriminators, including,
D g
D l
D h
, MSDA provides compelling evidence that our model certainly improves the single-level discriminative ability. It bolsters the performance by
10.01 %   ( 43.43 %   t o   33.42 % )  
WA and
13.75 %  
( 41.19 %   t o   27.44 % )  
UA from MSP-Improv to IEMOCAP. This demonstrates that it is not sufficient to only conduct domain discrimination on a single level.
Comparison between discriminators in different timescales on three speech corpora
Source Target Local D Hybrid D MSDA
WA (%) UA (%) WA (%) UA (%) WA (%) UA (%)
CASIA IEMOCAP 31.51 32.85 31.01 30.76 34.88 32.51
IEMOCAP CASIA 32.87 32.87 32.87 32.87 40.13 40.13
CASIA MSP-improv 32.43 27.85 32.43 27.85 37.19 29.70
MSP-improv CASIA 33.63 33.63 33.63 33.63 37.38 37.38
IEMOCAP MSP-improv 33.42 33.05 33.42 33.05 43.43 34.59
MSP-improv IEMOCAP 31.13 32.16 35.86 30.67 36.86 33.15
Hybrid without D and MSDA were all trained by hybrid-level features. It is observed in Table 4 that, when inputting the same hybrid-level features derived from global and local levels, the deficiency of
D g
D l
D h
leads to a significant performance decline of at the maximum
7.76 %  
in WA and UA from MSP-Improv to CASIA. This is due to the immense discrepancy in the distribution between the source and target corpus or different languages. It reveals that the model trained on one corpus cannot be directly applied to another unlabeled corpus.
Comparison of cases with and without hierarchical discriminators on three speech corpora
Source Target Hybrid without D MSDA
WA (%) UA (%) WA (%) UA (%)
CASIA IEMOCAP 27.32 24.45 34.88 32.51
IEMOCAP CASIA 30.13 30.13 40.13 40.13
CASIA MSP-improv 30.21 27.13 37.19 29.70
MSP-improv CASIA 27.37 27.37 37.38 37.38
IEMOCAP MSP-improv 28.16 25.08 43.43 34.59
MSP-improv IEMOCAP 33.62 24.96 36.86 33.15
By combining multi-scale features and hierarchical discriminators of three levels, significant improvement can be observed from MSP-Improv to IEMOCAP:
15.27 %   ( 43.43 %   t o   28.16 % )
in WA and
8.79 %   ( 34.59 %   t o   25.80 % )  
in UA. We can also observe that cross-language SER is more difficult than cross-corpus, although it uses the same language case. Since English and Chinese belong to the same phylum, there are many differences in expression modes, language habits, and grammar rules. MSDA obtained the best overall performance among all the methods, which implies that jointly reducing the distribution differences between training and testing data from different timescales can aid cross-domain scenarios.
6 Conclusion
In this paper, we proposed a model—the multi-scale discrepancy adversarial network (MSDA)—for cross-corpus speech emotion recognition. MSDA integrates multiple timescales of deep speech features to train a set of hierarchical domain discriminators and an emotion classifier simultaneously in an adversarial training network. Our model eliminates the distribution discrepancy in a hierarchical manner while preserving emotion-related information between different corpora and multiple languages. Powered by strong discriminators, the model achieves excellent performance. The results of extensive experiments conducted on three corpora demonstrated the effectiveness of our model.



Swain M, Routray A, Kabisatpathy P. Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 2018, 21(1): 93–120 DOI:10.1007/s10772-018-9491-z


Vrysas N, Kotsakis R, Liatsou A, Dimoulas C, Kalliris G. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 2018, 66(6): 457–467 DOI:10.17743/jaes.2018.0036


Kotsakis R, Dimoulas C, Kalliris G, Veglis A. Emotional prediction and content profile estimation in evaluating audiovisual mediated communication. International Journal of Monitoring and Surveillance Technologies Research, 2014, 2(4): 62–80 DOI:10.4018/ijmstr.2014100104


Gideon J, McInnis M, Mower Provost E. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Transactions on Affective Computing, 2019: 1 DOI:10.1109/taffc.2019.2916092


Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Transactions on Affective Computing, 2010, 1(2): 119–131 DOI:10.1109/t-affc.2010.8


Ntalampiras S. Toward language-agnostic speech emotion recognition. Journal of the Audio Engineering Society, 2020, 68(1/2): 7–13 DOI:10.17743/jaes.2019.0045


Sahu S, Gupta R, Espy-Wilson C. On enhancing speech emotion recognition using generative adversarial networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1883


Bao F, Neumann M, Vu N T. CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. In: Interspeech 2019. ISCA, 2019, 35–37 DOI:10.21437/interspeech.2019-2293


Han J, Zhang Z, Ren Z, Ringeval F, Schuller B. Towards conditional adversarial training for predicting emotions from speech. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, 6822–6826 DOI:10.1109/ICASSP.2018.8462579


Salamon J, Bello J P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 2017, 24(3): 279–283 DOI:10.1109/lsp.2017.2657381


Abdelwahab M, Busso C. Domain adversarial for acoustic emotion recognition. ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423–2435 DOI:10.1109/taslp.2018.2867099


Pan S J, Tsang I W, Kwok J T, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199–210 DOI:10.1109/tnn.2010.2091281


Ververidis D, Kotropoulos C. Emotional speech recognition: resources, features, and methods. Speech Communication, 2006, 48(9): 1162–1181 DOI:10.1016/j.specom.2006.04.003


Fayek H M, Lech M, Cavedon L. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 2017, 92: 60–68 DOI:10.1016/j.neunet.2017.02.013


Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A. Acoustic emotion recognition: A benchmark comparison of performances. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. 2009, 552–557 DOI:10.1109/ASRU.2009.5372886


Jeon J H, Xia R, Liu Y. Sentence level emotion recognition based on decisions from subsentence segments. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011, 4940–4943 DOI:10.1109/ICASSP.2011.5947464


Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G. Continuous speech emotion recognition with convolutionacnnl neural networks. Journal of the Audio Engineering Society, 2020. 68(1/2), 14–24


Zhang S, Huang T, Gao W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 2018, 20(6):1576–1590 DOI:10.1109/TMM.2017.2766843


Vrysis L, Tsipas N, Thoidis I, Dimoulas C. 1D/2D deep CNNs vs. temporal feature integration for general audio classification. Journal of the Audio Engineering Society, 2020, 68(1/2), 66–77


Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing. MIT Press, 2014, 2672–2680


Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V. Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research, 2016, 17(1): 2096–2030


Müller C. The interspeech 2010 paralinguistic challenge. Proc Interspeech, 2010, 2794–2797


Eyben F, Wöllmer M, Schuller B. Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia. Firenze, Italy, Association for Computing Machinery, 2010, 1459–1462 DOI:10.1145/1873951.1874246


Qian K, Zhang Y, Chang S, Cox D, Hasegawa-Johnson M. Unsupervised speech decomposition via triple information bottleneck. 2020


Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 2012, 3(2):184–198 DOI:10.1109/T-AFFC.2011.40


Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Computer Science, 2014


Vinyals O, Kaiser L, Koo T, Petrov S, Sutskever I, Hinton G. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. Montreal, Canada, Press MIT, 2015, 2773–2781


Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335 DOI:10.1007/s10579-008-9076-6


The selected speech emotion database of institute of automation chinese academy of sciences (CASIA). http://www.datatang.com/ data/39277


Busso C, Parthasarathy S, Burmania A, AbdelWahab M, Sadoughi N, Provost E M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2017, 8(1):67–80 DOI:10.1109/TAFFC.2016.2515617


Li H, Tu M, Huang J, Narayanan S, Georgiou P. Speaker-invariant affective representation learning via adversarial training. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 7144–7148 DOI:10.1109/ICASSP40776.2020.9054580.


Sayan G, Eugene L, Louis-Philippe M, Stefan S. Representation learning for speech emotionrecognition. Interspeech, 2016, 3603–3607


Xu Y, Xu H, Zou J. HGFM: A hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, 6499–6503 DOI:10.1109/ICASSP40776.2020.9053039