Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board


2021,  3 (1):   55 - 64

Published Date:2021-2-20 DOI: 10.1016/j.vrih.2020.11.005


Continuous emotion recognition as a function of time assigns emotional values to every frame in a sequence. Incorporating long-term temporal context information is essential for continuous emotion recognition tasks.
For this purpose, we employ a window of feature frames in place of a single frame as inputs to strengthen the temporal modeling at the feature level. The ideas of frame skipping and temporal pooling are utilized to alleviate the resulting redundancy. At the model level, we leverage the skip recurrent neural network to model the long-term temporal variability by skipping trivial information for continuous emotion recognition.
The experimental results using the AVEC 2017 database demonstrate that our proposed methods are beneficial to a performance improvement. Further, the skip long short-term memory (LSTM) model can focus on the critical emotional state when training the models, thereby achieving a better performance than the LSTM model and other methods.


1 Introduction
Emotion recognition has become an increasingly important research topic owing to its essential role in human communication[1]. A person's emotional state can be described using a dimension emotion model, which uses numerical values to indicate the type and degree of emotions continuously[2], such as arousal (calm versus active) and valence (negative versus positive). The dimension emotion model has the advantages of representing multiple subtle and complex emotional states.
Prior studies have investigated various discriminative emotional features and effective emotion recognition methods[3,4]. However, an emotional state is a gradual and smooth process with temporal changes. Therefore, it is essential to incorporate long-term temporal context information when making a dense frame-level prediction for continuous emotion recognition. Two types of approaches exploiting long-term temporal contexts use feature-level representations and can learn the long-term dependencies based on short-term feature representations.
Many emotional feature-level representations have been designed to contain temporal context information. Valstar et al. showed that it was necessary to consider a larger window to obtain segment-level features, e.g., 4s for arousal and 6s for valence[5]. Le et al. and Cardinal et al. found that increasing the number of contextual frames with deep neural networks can improve the performance[6,7]. In addition, Povolny et al. applied simple frame stacking and temporal content summarization to consider contextual information[8]. Moreover, Huang et al. utilized temporal pooling subsampling to obtain wider contexts and achieve emotional temporal modeling[3] and Han et al. extracted segment-level BoAW(bag-of-audio-words) features to encapsulate the contexts with neighboring frames[9]. In speech recognition, incorporating long-term temporal information by splicing the input sequences is a common practice to improve the performance[10,11]. In addition, Zeyer et al. used max-pooling in the time dimension to downscale the original inputs for speech recognition[12].
The above methods are operated at the feature level to obtain temporal context information. However, these methods still depend on the temporal modeling ability of the emotion recognition model. Researchers have explored many models to learn emotional dynamic information from sequence signals. For example, Wöllmer et al. first employed a long short-term memory (LSTM) network to achieve a continuous emotion recognition[13] and Khorram et al. investigated two convolutional network architectures, dilated convolutional networks and downsampling–upsampling networks, to learn emotional long-term temporal dependencies[14]. In addition, Zhang et al. utilized a recurrent neural network (RNN) to learn spatial and temporal dependencies for emotion recognition[15], whereas Li et al. proposed a convolution neural network (CNN) with two different groups of filters to achieve a similar purpose[16]. Finally, Trigeorgis et al. trained a convolutional recurrent network using time-domain signals directly[17].
Most results have indicated that the performance of an RNN is superior to that of a CNN in terms of continuous emotion recognition. At the model level, LSTM-RNNs have become common deep neural network structures for continuous emotion recognition because they naturally learn emotional dynamic information between emotional features. However, the training of basic RNNs on long sequences often faces numerous challenges such as a slow inference, vanishing gradients, and difficulty in capturing long-term dependencies. Simple solutions are to reduce the length of continuous input signals, such as segmenting the training samples[18] and frame skipping[19] to shorten the length of the inputs, which are based on heuristics and can be suboptimal. In addition, these approaches are based on the feature level.
Fortunately, researchers have proposed many improved RNNs for enhancing the ability to model long sequences. A more recent thought is to skip the input information at certain steps, such as LSTM-Jump[20], skim RNN[21], and skip RNN[22]. Each of these methods can skip unimportant parts to reduce the length of the processing of the input signals. In particular, a skip RNN can select important parts of the input signals adaptively by determining whether the state needs to be updated or copied from the previous time step. This thought is similar to the attention mechanism that accommodates active selection from the most emotionally salient frames[23]. In this study, we take advantage of a skip RNN with emotional long-term feature representations to learn temporal contexts of continuous emotion recognition. We focus on dimension emotion recognition; in addition, for the emotion recognition model, the inputs are audio or video features, and the outputs are arousal and valence.
In the following, Section 2 briefly introduces the proposed methods. Section 3 presents the database and feature sets. Section 4 describes the experimental results and analysis. Finally, Section 5 provides some concluding remarks regarding this research.
2 Methods
In this study, the models learn long-term temporal context information to improve the performance of continuous emotion recognition. At the feature level, we preprocess the features with a large window of consecutive frames to cover the temporal contexts. Meanwhile, the ideas of frame skipping and temporal pooling release the impact of redundant information. In addition, a skip RNN is utilized to focus on more valuable information for modeling long sequences of continuous emotion recognition.
2.1 Data processing
We leverage four simple data processing methods to incorporate temporal contexts at the feature level. A straightforward method is to take a window of feature frames from the past and the future. Specifically, consecutive emotional features within the window are concatenated as the inputs, shown in Figure 1a and Figure 1b. The difference is that the features are stacked with overlapping frames in Figure 1a, as represented by "Context1," whereas those in Figure 1b are based on frame skipping without overlapping frames, as represented by "Context2." However, these methods will result in some redundancy of the emotional features. Thus, we introduce temporal pooling to average the features instead of applying a concatenation within the window in Figure 1c, as represented by "Context3." The operation of "Context1" followed by "Context3" is shown in Figure 1d and is represented by "Context4." Note that we average the labels within the same window to match the shortened length of the emotional features for "Context2," "Context3," and "Context4" when training their models. The window length is regarded as a parameter to be optimized.
2.2 Skip RNN
We introduce a skip RNN model to conduct a continuous emotion recognition. A skip RNN, a variant of an RNN, conducts fewer updating time steps to decrease the number of sequential operations and provide faster convergence in involving long sequences. As a result, trivial information can be skipped and a superior ability to model long-term dependencies for continuous emotion recognition is achieved.
A standard RNN takes an input sequence
x = x 1 , ,   x T
and generates a state sequence
s = s 1 , ,   s T
by iteratively applying a parametric state transition model
  t = 1
s t = S s t - 1 ,   x t
A skip RNN has a binary state update gate,
u t 0,1
, selecting whether the state of the RNN will be updated (
u t = 1
) or copied from the previous time step (
u t = 0
). At every time step
, the probability
u ˜ t + 1 0 ,   1
of applying a state update at
t + 1
is emitted. The resulting architecture is depicted in Figure 2a and can be characterized as follows:
u t = f b i n a r i z e u ˜ t
s t = u t   S s t - 1 ,   x t + 1 - u t s t - 1
Δ u ˜ t = σ W p s t + b p
u ˜ t + 1 = u t Δ u ˜ t + 1 - u t u ˜ t + m i n Δ u ˜ t , 1 - u ˜ t
W p
is a weight vector,
b p
is a scalar bias, σ is the sigmoid function, and
f b i n a r i z e : 0,1
binarizes the input value. In addition,
f b i n a r i z e
is a deterministic step function,
u t = r o u n d u ˜ t
, and its gradient is set to
f b i n a r i z e / x = 1
during a backward pass[22]. Thus, the model can be trained to minimize the target loss function with standard backpropagation without defining any additional parameters.
Whenever a state update is omitted, the pre-activation of the state update gate for the following time step
u ˜ t + 1
is incremented by
Δ u ˜ t
. When a state update is conducted,
u t = 1
, as shown in Figure 2b, the accumulated value is flushed and
u ˜ t + 1 = Δ u ˜ t
. When the model skips the state update,
u t = 0
, as shown in Figure 2c,
s t = s t - 1
, which enables more efficient implementations in which no computations are conducted at all. Thus, redundant computations are avoided because
Δ u ˜ t = σ W p s t + b p = σ W p s t - 1 + b p = Δ u ˜ t - 1
, and there is no need to evaluate it again, as shown in Figure 2(d). This enables the skip RNN to shield the hidden state over a longer period of time.
3 Database and feature sets
3.1 Database
We use the Audio/Visual Emotion Challenge and Workshop (AVEC 2017) database to show the benefits of our proposed methods[24]. This database collects spontaneous and naturalistic human-human interactions in the wild, consisting of audio, visual, and textual modalities. The recordings are annotated time-continuously in terms of the emotional dimensions, including arousal for the emotion activation and valence for the emotion positiveness. All emotional dimensions are annotated every 100ms and scaled into [-1, +1]. There are 64 German subjects in the dataset, which is divided into a training set with 34 subjects, a development set with 14 subjects, and a testing set with 16 subjects. We focused on an estimation of the arousal and valence in this study.
3.2 Feature sets
In this study, we utilized audio and visual modalities for continuous emotion recognition. The extended Geneva minimalistic acoustic parameter set (eGeMAPS)[25] is regarded as the audio feature. The window segment-level acoustic features are computed over a 4-s period, resulting in 88 dimensional features. The extraction of the LLDs and the computations of the functionals are conducted using openSMILE[26]. The geometric features are regarded as visual features, including facial landmark locations, facial action units, head pose features, and eye gaze features extracted by OpenFace[27]. We then apply a principal component analysis to reduce the dimensionality and maintain 90% variance, resulting in 55 dimensions.
4 Experiment and analysis
4.1 Experimental setup
We utilized an LSTM as a basic continuous emotion recognition model. The number of hidden layers is 1 with 64 hidden nodes for the LSTM and skip LSTM models. We use the Adam optimization algorithm[28] with a dropout rate of 0.5 and a batch size of 3. The maximum number of training epochs is 100. The evaluation measure is the concordance correlation coefficient (CCC)[29]. Owing to a lack of availability of the testing set, we utilized the training set to train the models and the development set to evaluate the performance.
4.2 Effect of data processing
To incorporate temporal contexts, we use several novel methods to process the input features at the feature level. In this section, we describe the experiments conducted using a basic LSTM model.
First, we explore the effect of the window length for these methods. Figure 3 shows the experimental results of the audio features in terms of arousal and valence against the length of the window, which denotes the number of frames. The performance is improved with the increase in window length, showing that a longer window is necessary to model the emotional temporal context information. In particular, Context3 requires a longer window compared with the other methods because it compresses the features to obtain extremely long span contexts. The results indicate that the window length of the valence is larger than that the arousal; thus, valence accords longer temporal contexts, which is same as in [5]. The conclusions drawn from the visual features are similar to those from the audio features, except that visual features require longer windows, which are not displayed in this study owing to limited space.
The performance of different data processing methods is shown in Table 1, and the corresponding window lengths are from the optimal results of Figure 3. The Original models were trained without data processing. The results indicate that the performance of Context1 is worse than that of the Original models. We speculate that the LSTM is able to store past information internally; thus, the concatenation of consecutive frames increases the information redundancy, which results in a degraded performance. To address this problem, we borrow the ideas of frame skipping with concatenation for Context2 and mean temporal pooling for Context3, which achieved a better performance than Context1 and the Original models. This indicates that it is important to alleviate the redundancy of the emotional features, which effectively improves the performance. In addition, the ideas of frame skipping and temporal pooling shorten the input signals, which makes LSTM more accessible to longer sequence information.
Performance of different methods with LSTM model using audio and visual features in terms of arousal and valence
Arousal Valence
Audio Visual Audio Visual
Original 0.473 0.589 0.423 0.650
Context1 0.431 0.571 0.431 0.655
Context2 0.484 0.584 0.469 0.695
Context3 0.512 0.602 0.455 0.736
Context4 0.464 0.611 0.436 0.701
The optimal performance of arousal and valence are both from the visual features, specifically 0.611 for Context4 under arousal and 0.736 for Context3 under valence. The similarity between Context3 and Context4 is the operation of temporal pooling to compress the features. Thus, temporal pooling is essential to an emotion recognition model based on visual features. When using the audio features, the optimal performance is 0.512 for Context3 under arousal and 0.469 for Context2 under valence. The similarity between Context3 and Context2 is the operation of frame skipping to cover wider temporal contexts. Thus, frame skipping is more effective for emotion recognition models based on audio features. The results reveal the effectiveness of our proposed data processing methods to incorporate long-term context information, which is beneficial to a performance improvement. In addition, the performance of valence is better than that of arousal, which is similar to previous studies[3,4].
4.3 Continuous emotion recognition with skip RNN
It is important to consider long-term temporal dependencies to improve the performance of continuous emotion recognition. Based on the outstanding temporal modeling capability of the LSTM model, we further investigate the skip LSTM model to capture as much longer temporal information as possible by performing fewer state updates.
The experimental results are presented in Table 2. Context2 and Context3 achieve a better performance than other methods when using the skip LSTM model. Context2 obtains a better performance of 0.622 in terms of arousal with the visual features and 0.447 in terms of valence with the audio features. Context3 obtains a better performance of 0.531 in terms of arousal with the audio features and 0.698 in terms of valence with the visual features. This shows that frame skipping plays an important role in the emotion recognition model based on the skip LSTM model.
Performance of different methods with skip LSTM model using audio and visual features in terms of arousal and valence
Arousal Valence
Audio Visual Audio Visual
Original 0.487 0.483 0.420 0.634
Context1 0.473 0.472 0.415 0.641
Context2 0.465 0.622 0.447 0.690
Context3 0.531 0.603 0.427 0.698
Context4 0.492 0.562 0.441 0.666
Comparing Table 2 with Table 1, we can observe that the skip LSTM model achieves a better performance in arousal with the audio features of 0.531 versus 0.512 and visual features of 0.622 versus 0.611, whereas the LSTM model achieves a better performance in terms of valence with the audio features of 0.469 versus 0.447 and visual features of 0.736 versus 0.698. The skip LSTM does not show advantages over LSTM in all cases. This is perhaps because the skip LSTM model can access a longer temporal context in Context2 and Context3, which is beneficial to the emotion recognition model for arousal. This may be limited by the insufficiency of the training data. In total, incorporating long-term temporal context information on the feature level and model level together can improve the performance.
4.4 Analysis
As the true benefit of skip LSTM, fewer state updates focus on more valuable information for modeling longer sequences. We calculate the state update rates based on the feature frames when training the skip LSTM models. Table 3 compares the average state update rates with the standard deviation over five trials for different methods as a comparison. It is obvious that the Original models and Context1 have low update state rates, which implies that many sequence frames of the input signals are trivial and can be skipped. Context2 still has some redundancy after frame skipping with concatenation. Context3 and Context4 have high state update rates because temporal pooling has relieved the redundancy to a large extent. The visual features have more useless information than the audio features because of their lower update state rates. The results provide evidence that skip LSTM can effectively alleviate redundant information that exists in continuous emotion recognition tasks.
State update rates (mean±std %) of skip LSTM model using audio and visual features in terms of arousal and valence
Arousal Valence
Audio Visual Audio Visual
Original 59.17±8.89 52.91±5.90 84.43±1.56 59.09±0.92
Context1 65.00±2.74 50.02±0.73 59.02±3.78 45.27±3.53
Context2 89.02±1.93 69.06±1.91 88.39±3.10 81.56±3.03
Context3 93.01±1.93 91.54±2.04 95.29±1.66 89.03±1.07
Context4 99.29±1.83 76.92±3.95 99.99±0.01 82.55±0.94
We take an example of visualization with the Original models in terms of arousal, as shown in Figure 4. The blue line is the ground truth, the black line indicates the predictions, and the red dots denote the updated frames. Figure 4 illustrates that the updated frames concentrate on critical fluctuant and low-value areas, which helps to model complicated and subtle emotional states well. We compared our results with the baseline results[24] and the winner[4] of AVEC 2017. Ringeval et al. utilized a static SVM model as an emotional model[24]. Our models can learn long-term temporal dependencies; thus, the performance significantly outperforms the baseline results. Our proposed methods achieve a better performance than Chen's approach for a single modality[4], which utilized a basic LSTM model for emotion recognition. The skip LSTM model focuses on the modeling of crucial emotional states based on context information, which significantly promotes the ability of continuous emotional temporal modeling compared with the LSTM model.
CCC comparison between our methods and other approaches.
Arousal Valence
Audio Visual Audio Visual
Ringeval et al.[24] 0.344 0.466 0.351 0.400
Chen et al.[4] 0.526 0.623 0.447 0.708
Our methods 0.531 0.622 0.469 0.736
5 Conclusion
In this study, techniques for incorporating long-term temporal context information for continuous emotion recognition were considered. We widened a larger view of the feature frames with frame skipping and temporal pooling on the feature level and employed a skip RNN on the model level. The experimental results reveal that a longer window is necessary to model the temporal emotional contexts. It is also important to alleviate the redundancy of the emotional features. We illustrated the benefit of skip LSTM in the modeling of long-term temporal dependencies, which skips the trivial information to focus on critical emotional values. Therefore, the skip LSTM can model subtle and complicated emotional states based on the context information for continuous emotion recognition. Our methods improve the performance of the LSTM model and achieve a better performance than the other approaches. These promising results suggest that skip LSTM can lead to important improvements in continuous emotion recognition. In the future, we will further improve the performance of skip LSTM and explore multimodal emotion fusion of temporal modeling.



Tao J, Tan T. Affective computing: a review. In: Proceedings of International Conference on Affective Computing and Intelligent Interaction. Springer, 2005


Gunes H, Pantic M. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, 2010, 1(1): 68–99 DOI:10.4018/jse.2010101605


Huang J, Li Y, Tao J, Lian Z, Yi J. Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 2017


Chen S Z, Jin Q, Zhao J M, Wang S. Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. View Mountain, California, USA, York New, Press ACM, 2017 DOI:10.1145/3133944.3133949


Valstar M, Pantic M, Gratch J, Schuller B, Ringeval F, Lalanne D, Torres Torres M, Scherer S, Stratou G, Cowie R. AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, Netherlands The, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988258


Le D, Provost E M. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks in workshop on automatic speech recognition and understanding (ASRU). IEEE, 2013


Cardinal P, Dehak N, Koerich A L, Alam J, Boucher P. ETS system for AV+EC 2015 challenge. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge-AVEC '15. Brisbane, Australia, NewYork, Press ACM, 2015 DOI:10.1145/2808196.2811639


Povolny F, Matejka P, Hradis M, Popková A, Otrusina L, Smrz P, Wood I, Robin C, Lamel L. Multimodal emotion recognition for AVEC 2016 challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge-AVEC '16. Amsterdam, TheNetherlands, NewYork, Press ACM, 2016 DOI:10.1145/2988257.2988268


Han J, Zhang Z X, Schmitt M, Ren Z, Ringeval F, Schuller B. Bags in bag: generating context-aware bags for tracking emotions from speech. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-996


Hermansky H, Sharma S. TRAPS-classifiers of temporal patterns. In: Proceedings of International Conference on Spoken Language Processing. 1998


Chen B, Zhu Q, Morgan N. Learning long-term temporal features in LVCSR using neural networks. In: Proceedings of International Conference on Spoken Language Processing. 2004


Zeyer A, Irie K, Schlüter R, Ney H. Improved training of end-to-end attention models for speech recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1616


Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G. LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 2013, 31(2): 153–163 DOI:10.1016/j.imavis.2012.03.001


Khorram S, Aldeneh Z, Dimitriadis D, McInnis M, Provost E M. Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. In: Interspeech 2017. ISCA, 2017 DOI:10.21437/interspeech.2017-548


Zhang T, Zheng W M, Cui Z, Zong Y, Li Y. Spatial–temporal recurrent neural network for emotion recognition. IEEE Transactions on Cybernetics, 2019, 49(3): 839–847 DOI:10.1109/tcyb.2017.2788081


Li P C, Song Y, McLoughlin I, Guo W, Dai L R. An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1242


Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Zafeirio S. features? Adieu End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics. IEEE, 2016


Huang J, Li Y, Tao J H, Lian Z, Niu M Y, Yang M H. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop-AVEC'18. Seoul, Republic of Korea, New York, ACM Press, 2018 DOI:10.1145/3266302.3266304


Myer S, Tomar V S. Efficient keyword spotting using time delay neural networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1979


Yu A W, Lee H, Le Q V. Learning to skim text. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017


Seo M, Min S, Farhadi A. Neural speed reading via skim-RNN. In: ICLR. 2018


Campos V, Jou B, Giro-i-Nieto X, Torres J, Chang S. Skip RNN: learning to skip state updates in recurrent neural networks. In: ICLR. 2017


Aldeneh Z, Provost E M. Using regional saliency for speech emotion recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2017


Ringeval F, Pantic M, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cummins N, Schmitt M. AVEC 2017: real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge-AVEC '17. MountainView, California, USA, NewYork, Press ACM, 2017 DOI:10.1145/3133944.3133953


Eyben F, Scherer K R, Schuller B W, Sundberg J, Andre E, Busso C, Devillers L Y, Epps J, Laukka P, Narayanan S S, Truong K P. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016, 7(2): 190–202 DOI:10.1109/taffc.2015.2457417


Eyben F, Wöllmer M, Schuller B. Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the International Conference on Multimedia-MM '10. Firenze, Italy, NewYork, Press ACM, 2010 DOI:10.1145/1873951.1874246


Baltrusaitis T, Robinson P, Morency L P. OpenFace: an open source facial behavior analysis toolkit. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision. IEEE, 2016


Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of 3rd International Conference for Learning Representations. San Diego, 2015


Lin L I K. A concordance correlation coefficient to evaluate reproducibility. Biometrics, 1989, 45(1): 255 DOI:10.2307/2532051