scispace - formally typeset
Open AccessJournal ArticleDOI

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

TLDR
A basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis, is proposed and intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters.
Abstract
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.

read more

Content maybe subject to copyright    Report

The Geneva Minimalistic Acoustic Parameter
Set (GeMAPS) for Voice Research and
Affective Computing
Florian Eyben, Klaus R. Scherer, Bj
orn W. Schuller, Johan Sundberg, Elisabeth Andr
e, Carlos Busso,
Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong
Abstract—Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively
and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards
become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across
studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard
acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a
large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their
potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic
extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research
and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our
implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large
baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.
Index Terms—Affective computing, acoustic features, standard, emotion recognition, speech analysis, geneva minimalistic parameter set
Ç
1INTRODUCTION
I
NTEREST in the vocal expression of different affect states
has a long history with researchers working in various
fields of research ranging from psychiatry to engineering.
Psychiatrists have been attempting to diagnose affective
states. Psychologists and communication researchers have
been exploring the capacity of the voice to carry signals of
emotion. Linguists and phoneticians have been discovering
the role of affective pragmatic information in language pro-
duction and perception. More recently, computer scientists
and engineers have been attempting to automatically recog-
nize and manipulate speaker attitudes and emotions to ren-
der information technology more accessible and credible for
human users. Much of this research and development uses
the extraction of acoustic parameters from the speech signal
as a method to understand the patterning of the vocal expres-
sion of different emotions and other affective dispositions
and processes. The underlying theoretical assumption is that
affective processes differentially change autonomic arousal
and the tension of the striate musculature and thereby affect
voice and speech production on the phonatory and articula-
tory level and that these changes can be estimated by differ-
ent parameters of the acoustic waveform [1].
Emotional cues conveyed in the voice have been empiri-
cally documented recently by the measurement of emotion-
differentiating parameters related to subglottal pressure,
transglottal airflow, and vocal fold vibration ([2], [3], [4], [5],
[6], [7], [8]). Mostly based on established procedures in pho-
netics and speech sciences to measure different aspects of
phonation and articulation in speech, researchers have used
a large number of acoustic parameters (see [9]; [10], for
overviews), including parameters in the Time domain (e.g.,
speech rate), the Frequency domain (e.g., fundamental fre-
quency (F
0
) or formant frequencies), the Amplitude domain
(e.g., intensity or energy), and the Spectral Energy domain
(e.g., relative energy in different frequency bands). Not
all of these parameters have been standardized in terms of
F. Eyben is with audEERING UG, Gilching, Germany, Technische
Universit
at, M
unchen, Germany, and the Swiss Centre for Affective
Sciences, Geneva, Switzerland. E-mail: fe@audeering.com.
K. R. Scherer is with the Swiss Centre for Affective Sciences, and
Universit
e de Gen
eve, Geneva, Switzerland, University of Munich,
Munich, Germany. E-mail: Klaus.Scherer@unige.ch.
B. W. Schuller is with the Chair of Complex & Intelligent Systems,
University of Passau, Passau, Germany, and the Deparment of Comput-
ing, Imperial College, London, U.K., audEERING UG, Gilching, and the
Swiss Centre for Affective Sciences, Geneva, Switzerland.
E-mail: schuller@tum.de.
J. Sundberg is with KTH Royal Institute of Technology, Stockholm,
Sweden. E-mail: pjohan@speech.kth.se.
E. Andr
e is with the Faculty of Applied Computer Science, Universit
at
Augsburg, Germany. E-mail: andre@informatik.uni-augsburg.de.
C. Busso is with the Department of Electrical Engineering, University of
Texas, Dallas, TX, USA. E-mail: busso@utdallas.edu.
L. Y. Devillers is with University of Paris-Sorbonne IV and CNRS/LIMSI,
Paris, France. E-mail: devil@limsi.fr.
J. Epps is with University of New South Wales, Sydney, Australia and
NICTA ATP Laboratory, Eveleigh, Australia. E-mail: j.epps@unsw.edu.au.
P. Laukka is with Stockholm University, Stockholm, Sweden.
E-mail: petri.laukka@psychology.su.se.
S. S. Narayanan is with SAIL, University of Southern California, Los
Angeles, CA, USA. E-mail: shri@sipi.usc.edu.
K. P. Truong is with the Department of Human Media Interaction, Univer-
sity of Twente, Enschede, The Netherlands. E-mail: k.p.truong@utwente.nl.
Manuscript received 17 Nov. 2014; accepted 2 June 2015. Date of publication
15 July 2015; date of current version 6 June 2016.
Recommended for acceptance by K. Hirose.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TAFFC.2015.2457417
190 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/

their exact computation and thus results reported in the lit-
erature cannot always be easily compared. Even where
parameters have been extracted using widely used tools
like Praat [11], the exact settings used are not usually easily
and publicly accessible. Furthermore, different studies often
use sets of acoustic features that overlap only partially,
again rendering comparison of results across studies
exceedingly difficult and thus endangering the cumulation
of empirical evidence. The recent use of machine learning
algorithms for the recognition of affective states in speech
has led to a proliferation in the variety and quantity of
acoustic features employed, amounting often to several
thousand basic (low-level) and derived (functionals) param-
eters (e.g., [12]). While this profusion of parameters allows
to capture many acoustic characteristics in a comprehensive
and reliable manner, this comes at the cost of serious diffi-
culties in the interpretation of the underlying mechanisms.
However, applications such as the fine grained control of
emotionality in speech synthesis (cf. [13], [14]), or dimen-
sional approaches to emotion and mental state recognition
that seek to quantify arousal, valence or depression severity,
for example, along a single axis, all require a deeper under-
standing of the mechanism of production and perception of
emotion in humans. To reach this understanding, finding
and interpreting relevant acoustic parameters is crucial.
Thus, based on many previous findings in the area of
speech and voice analysis (e.g., [2], [9], [15], [16], [17], [18],
[19]), in this article the authors present a recommendation
for a minimalistic standard parameter set for the acoustic
analysis of speech and other vocal sounds. This standard
set is intended to encourage researchers in this area to adopt
it as a baseline and use it alongside any specific parameters
of particular interest to individual researchers or groups, to
allow replication of findings, comparison between studies
from different laboratories, and greater cumulative insight
from the efforts of different laboratories on vocal concomi-
tants of affective processes.
Moreover, large brute-forced feature sets are well known
to foster over-adaptation of classifiers to the training data in
machine learning problems, reducing their generalisation
capabilities to unseen (test) data (cf. [20]). Minimalistic
parameter sets might reduce this danger and lead to better
generalisation in cross-corpus experiments and ultimately
in real-world test scenarios. Further, as mentioned above,
the interpretation of the meaning of the parameters in a
minimalistic set is much easier than in large brute-forced
sets, where this is nearly impossible.
The remainder of this article is structured as follows: First,
Section 2 provides a brief overview of acoustic analyses in the
fields of psychology, phonetics, acoustics, and engineering
which are the basis of the recommendation proposed in this
article; next, in Section 3 we give a detailed description of the
acoustic parameters contained in the recommended parame-
ter set and the implementation thereof. The parameter set is
extensively evaluated on six well-known affective speech
databases and the classification performance is compared to
all high-dimensional brute-forced sets of the INTERSPEECH
Challenges on Emotion and Paralinguistics from 2009 to 2013
in Section 4. Final remarks on the parameters recommended
in this article and the classification performance relative to
other established sets as well as a discussion on the direction
of future research in this field are given in Section 5.
2RELATED WORK
The minimalistic feature set proposed in this article is not
the first joint attempt to standardise acoustic parameter sets.
The CEICES initiative [21], for example, brought researchers
together who were working on identification of emotional
states from the voice. They combined the acoustic parame-
ters they had used in their individual work in a systematic
way in order to create large, brute-forced parameter sets,
and thereby identify individual parameters by a unique
naming (code) scheme. However, the exact implementation
of the individual parameters was not well standardised.
CEICES was a more engineering-driven “collector” appro-
ach where parameters which were successful in classifica-
tion experiments were all included, while GeMAPS is a
more interdisciplinary attempt to agree on a minimalistic
parameter set based on multiple source, interdisciplinary
evidence and theoretical significance or a few parameters.
Related programs for computation of acoustic parame-
ters, which are used by both linguists and computer science
researchers, include the popular Praat toolkit [11] or
Wavesurfer
1
.
This section gives a literature overview on studies where
parameters that form the basis of our recommendation have
been proposed and used for voice analysis and related
fields.
An early survey [15] and a recent overview [17] nicely
summarise a few decades of psychological literature on
affective speech research and concludes from the empirical
data presented that intensity (loudness), F0 (fundamental
frequency) mean, variability, and range, as well as the high
frequency content/energy of a speech signal show correla-
tions with prototypical vocal affective expressions such as
stress (Intensity, F0 mean), anger and sadness (all parame-
ters), and boredom (F0 variability and range), for example.
Further, speech and articulation rate was found to be impor-
tant for all emotional expressions. For the case of automatic
arousal recognition, [22] successfully builds an unsuper-
vised recognition framework with these descriptors.
Hammerschmidt and J
urgens[16] perform acoustic anal-
ysis of various fundamental frequency and harmonics
related parameters on a small set of emotional speech utter-
ances. The findings confirm that parameters related to F0
and spectral distribution are important cues to affective
speech content. Hammerschmidt and J
urgens[16] introduce
a ratio of the peak frequency to the fundamental frequency,
and use spectral roll-off points (called distribution of fre-
quency—DFB—there). More recently, [18], also validate the
discriminatory power of amplitude, pitch, and spectral pro-
file (tilt, balance, distribution) parameters for a larger set of
vocal emotional expressions.
Most studies, such as the two previously mentioned, deal
with the analysis of acoustic arousal and report fairly con-
sistent parameters which are cues to vocal arousal (nicely
summarised by [17]). The original findings that prosodic
parameters (F0 and intensity) are relevant for arousal have
been confirmed in many similar studies, such as [4], and
more automatic, machine learning based parameter evalua-
tion studies such as [23]. Regarding energy/intensity, [24]
1. http://www.speech.kth.se/wavesurfer/
EYBEN ET AL.: THE GENEVA MINIMALISTIC ACOUSTIC PARAMETER SET (GEMAPS) FOR VOICE RESEARCH AND AFFECTIVE COMPUTING 191

shows that a loudness measure, in which the signal energy
in various frequency bands is weighted according to the
human-hearing’s frequency sensitivity, is better correlated
to vocal affect dimensions than the simple signal energy
alone. Further, it is shown there, that spectral flux has the
overall best correlation for a single feature.
Recent work, such as [17] and [25], has dealt with other
dimensions besides arousal—in particular valence (both)
and the level of interest (LOI) [25]. For valence both of these
studies conclude that spectral shape parameters could be
important cues for vocal valence. Also, rhythm related
parameters, such as speaking rate are correlated with
valence. Tahon and Devillers [26] confirms the importance
of various spectral band energies, spectral slope, overall
intensity, and the variance of the fundamental frequency,
for the detection of angry speech. These parameters were
also reported to be important for cognitive load [27] and
psychomotor retardation [28].
Eyben et al. [25] also show a large importance of cepstral
parameters (Mel-Frequency-Cepstral-Coefficients—MFCC),
especially for LOI. These are closely related to spectral
shape parameters. Especially the lower order MFCC, resem-
ble spectral tilt (slope) measures to some extent over the full
range of the spectrum (first coefficient), or in various
smaller sub-bands (second and higher coefficient). The rele-
vance of spectral slope and shape is also investigated and
confirmed by [29], for example, and by [30] and [31].
In contrast to the findings in [15], for example, [25] sug-
gests that the relative importance of prosodic parameters as
well as voice quality parameters decreases in the case of
degraded audio conditions (background noise, reverbera-
tion), while the relative importance of spectral shape param-
eters increases. This is likely due to degraded accuracy in the
estimation of the prosodic parameters such as due to interfer-
ing harmonics or energy contributed by the noise compo-
nents. Overall, we believe that the lower order MFCC are
important to consider for various tasks and thus we include
MFCC 1-4 in the parameter set proposed in this article.
For automatic classification, large-scale brute-force acous-
tic parameter sets are used (e.g., [12], [32], [33], [34]). These
contain parameters which are easily and reliably computable
from acoustic signals. The general tendency in most studies
is, that larger parameter sets perform better [34]. This might
be due to the fact that in larger feature sets the ‘right’ features
are more likely present, or due to the fact that the combina-
tion of all features is necessary. Another reason might be that
with this many parameters (over 6,000 in some cases), the
machine learning methods simply over-adapt to the (rather)
small training data-sets. This is evident especially in cross-
corpus classification experiments, where the large feature
sets show poorer performance despite their higher perfor-
mance in intra-corpus evaluations [20]. As said, it is thus our
aim in this article to select relevant parameters, guided by
the findings of previous, related studies.
Besides vocal emotional expressions, there are numerous
other studies which deal with other vocal phenomena and
find similar and very related features to be important. [27], for
example, shows the importance of vowel-based formant fre-
quency statistics, and [5], for example, shows the usefulness
of glottal features when combined with prosodic features for
identification of depression in speech. Voice source features,
in particular the harmonic difference H1-H2, showed a con-
sistent decrease with increasing cognitive load, based on a
study employing manually corrected pitch estimates [35].
Recently, researchers have attempted to analyse further para-
linguistic characteristics of speech, ranging from age and gen-
der [36], to cognitive and physical load [37], for example.
Many automatically extracted brute-force parameter sets
neglect formant parameters due to difficulties in extracting
them reliably. For voice research and automatic classifica-
tion, they are very important though. Formants have been
shown sensitive to many forms of emotion and mental state
and formants give approximately state of the art cognitive
load classification results [27] and depression recognition
and assessment results [31], [38], and can provide competi-
tive emotion recognition performance [39] with a fraction of
the feature dimension of other systems. A basic set of for-
mant related features is thus included in our proposed set.
Due to the proven high importance of the fundamental
frequency (cf. [6]) and amplitude/intensity, a robust funda-
mental frequency measure and a pseudo-auditory loudness
measure are included in our proposed set. A wide variety
of statistics are applied to both parameters over time, in
order to capture distributional changes. To robustly repre-
sent the high frequency content and the spectral balance,
the descriptors alpha ratio, Hammarberg index, and spec-
tral slope are considered in this article. The vocal timbre is
encoded by Mel-Frequency Cepstral Coefficients, and the
quality of the vocal excitation signal by the period-to-period
jitter and shimmer of F0. To allow for vowel-based voice
research, and due to their proven relevance for certain tasks,
formant parameters are also included in the set.
3ACOUSTIC PARAMETER RECOMMENDATION
The recommendation presented here has been conceived at
an interdisciplinary meeting of voice and speech scientists
in Geneva
2
and further developed at Technische Universit
at
M
unchen (TUM). The choice of parameters has been guided
(and is justified) by three criteria: 1) the potential of an
acoustic parameter to index physiological changes in voice
production during affective processes, 2) the frequency and
success with which the parameter has been used in the past
literature (see Section 2), and 3) its theoretical significance
(see [1], [2]).
Two versions of the acoustic parameter set recommenda-
tion are proposed here: a minimalistic set of parameters,
which implements prosodic, excitation, vocal tract, and
spectral descriptors found to be most important in previous
work of the authors, and an extension to the minimalistic
set, which contains a small set of cepstral descriptors,
which—from the literature (e.g., [40])—are consistently
known to increase the accuracy of automatic affect recogni-
tion over a pure prosodic and spectral parameter set. Sev-
eral studies on automatic parameter selection, such as [23],
[24], suggest that the lower order MFCCs are more
2. Conference organised by K. Scherer, B. Schuller, and J. Sundberg
on September 1–2, 2013 at the Swiss Center of Affective Sciences in
Geneva on Measuring affect and emotion in vocal communication via acous-
tic feature extraction: State of the art, current research, and benchmarking
with the explicit aim of commonly working towards a recommendation
for a reference set of acoustic parameters to be broadly used in the field.
192 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016

important for affect and paralinguistic voice analysis tasks.
When looking at the underlying Discrete Cosine Transfor-
mation (DCT-II) base functions used when computing
MFCCs, it is evident that the lower order MFCC are related
to spectral tilt and the overall distribution of spectral
energy. Higher order MFCCs would reflect more fine
grained energy distributions, which are presumably more
important to identify phonetic content than non-verbal
voice attributes.
To encourage rapid community discussion on the param-
eter sets, as well as updates and additions from the commu-
nity, a wiki-page
3
has been set up, where researchers can
quickly connect and discuss issues with the parameter set.
New ideas, if they are favoured by multiple contributors,
will then be implemented and after a certain number of
improvements or after a certain time frame, new versions of
the parameter sets will be released publicly.
In the following sections, we first give an overview over
the minimalistic parameter recommendation (Section 3.1),
and the extended parameter set (Section 3.2), before describ-
ing details of the algorithms used to compute the parame-
ters in Section 6.1.
3.1 Minimalistic Parameter Set
The minimalistic acoustic parameter set contains the follow-
ing compact set of 18 low-level descriptors (LLD), sorted by
parameter groups:
Frequency related parameters:
Pitch, logarithmic F
0
on a semitone frequency scale,
starting at 27.5 Hz (semitone 0).
Jitter, deviations in individual consecutive F
0
period
lengths.
Formant 1, 2, and 3 frequency, centre frequency of
first, second, and third formant
Formant 1, bandwidth of first formant.
Energy/Amplitude related parameters:
Shimmer, difference of the peak amplitudes of con-
secutive F
0
periods.
Loudness, estimate of perceived signal intensity
from an auditory spectrum.
Harmonics-to-noise ratio (HNR), relation of energy
in harmonic components to energy in noise-like
components.
Spectral (balance) parameters:
Alpha Ratio, ratio of the summed energy from
50-1000 Hz and 1-5 kHz
Hammarberg Index, ratio of the strongest energy
peak in the 0-2 kHz region to the strongest peak in
the 2–5 kHz region.
Spectral Slope 0-500 Hz and 500-1500 Hz, linear
regression slope of the logarithmic power spectrum
within the two given bands.
Formant 1, 2, and 3 relative energy, as well as the
ratio of the energy of the spectral harmonic peak at
the first, second, third formant’s centre frequency to
the energy of the spectral peak at F
0
.
Harmonic difference H1-H2, ratio of energy of the
first F
0
harmonic (H1) to the energy of the second F
0
harmonic (H2).
Harmonic difference H1-A3, ratio of energy of the
first F
0
harmonic (H1) to the energy of the highest
harmonic in the third formant range (A3).
All LLD are smoothed over time with a symmetric mov-
ing average filter 3 frames long (for pitch, jitter, and shim-
mer, the smoothing is only performed within voiced
regions, i.e., not smoothing the transitions from 0
(unvoiced) to non 0). Arithmetic mean and coefficient of varia-
tion (standard deviation normalised by the arithmetic
mean) are applied as functionals to all 18 LLD, yielding 36
parameters. To loudness and pitch the following 8 functionals
are additionally applied: 20th, 50th, and 80th percentile, the
range of 20th to 80th percentile, and the mean and standard
deviation of the slope of rising/falling signal parts. All function-
als are applied to voiced regions only (non-zero F0), with
the exception of all the functionals which are applied to
loudness. This gives a total of 52 parameters. Also, the arith-
metic mean of the Alpha Ratio, the Hammarberg Index, and
the spectral slopes from 0-500 Hz and 500-1500 Hz over all
unvoiced segments are included, totalling 56 parameters. In
addition, six temporal features are included:
the rate of loudness peaks, i.e., the number of loud-
ness peaks per second,
the mean length and the standard deviation of con-
tinuously voiced regions (F
0
> 0),
the mean length and the standard deviation of
unvoiced regions ( F
0
¼ 0; approximating pauses),
the number of continuous voiced regions per sec-
ond (pseudo syllable rate).
No minimal length is imposed on voiced or unvoiced
regions, i.e., in the extreme case they could be only one
frame long. The Viterbi-based smoothing of the F
0
contour,
however, prevents single voiced frames which are, e.g.,
missing by error effectively. In total, 62 parameters are con-
tained in the Geneva Minimalistic Standard Parameter Set.
3.2 Extended Parameter Set
The minimalistic set does not contain any cepstral parame-
ters and only very few dynamic parameters (i.e., it contains
no delta regression coefficients and no difference features;
only the slopes of rising and falling F
0
and loudness seg-
ments encapsulate some dynamic information). Further,
especially cepstral parameters have proven highly success-
ful in modelling of affective states, e.g., by [23], [40], [41].
Thus, an extension set to the minimalistic set is proposed
which contains the following seven LLD in addition to the
18 LLD in the minimalistic set:
Spectral (balance/shape/dynamics) parameters:
MFCC 1-4 Mel-Frequency Cepstral Coefficients 1-4.
Spectral flux difference of the spectra of two conse-
cutive frames.
Frequency related parameters:
Formant 2-3 bandwidth added for completeness of
Formant 1-3 parameters.
As functionals, the arithmetic mean and the coefficient of var-
iation are applied to all of these seven additional LLD to all
3. http://www.audeering.com/research/gemaps
EYBEN ET AL.: THE GENEVA MINIMALISTIC ACOUSTIC PARAMETER SET (GEMAPS) FOR VOICE RESEARCH AND AFFECTIVE COMPUTING 193

segments (voiced and unvoiced together), except for the for-
mant bandwidths to which the functionals are applied only
in voiced regions. This adds 14 extra descriptors. Addition-
ally, the arithmetic mean of the spectral flux in unvoiced
regions only, the arithmetic mean and coefficient of varia-
tion of the spectral flux and MFCC 1-4 in voiced regions
only is included. This results in another 11 descriptors.
Additionally the equivalent sound level is included. This
results in 26 extra parameters. In total, when combined with
the Minimalistic Set, the extended Geneva Minimalistic Acous-
tic Parameter Set (eGeMAPS) contains 88 parameters.
4BASELINE E VALUATION
The proposed minimalistic parameter set and the extended
set are both evaluated for the task of automatic recognition
in binary arousal and binary valence dimensions. The origi-
nal labels (mixed various categories and continuous dimen-
sional) of six standard databases of affective speech were
mapped to binary dimensional labels (Arousal/Valence), as
described in Section 4.2 in order to enable a fair comparison
of performances on these databases.
The original labels (cf. Section 4.1 for details on the data-
bases) are: Levels of Interest (TUM AVIC database), acted
speech emotions in the Geneva Multimodal Emotion Por-
trayals (GEMEP) corpus and the German Berlin Emotional
Speech database (EMO-DB), emotions portrayed in the sing-
ing voice of professional opera singers (GeSiE), valence in
childrens’ speech from the FAU AIBO corpus [42] as used
for the INTERSPEECH 2009 Emotion Challenge [43], as well
as real-life emotions from German talk-show recordings
(Vera-am-Mittag corpus (VAM)). The proposed minimal
sets are compared to five large-scale, brute-forced baseline
acoustic feature sets of the INTERSPEECH 2009 Emotion
Challenge [43] (384 parameters), the INTERSPEECH 2010
Paralinguistic Challenge [36] (1,582 parameters), the INTER-
SPEECH 2011 Speaker State Challenge [44] (4,368 parame-
ters), the INTERSPEECH 2012 Speaker Trait Challenge [45]
(6,125 parameters), and the INTERSPEECH 2013 Computa-
tional Paralingusitics ChallengE (ComParE) [12] set (6,373
parameters), which is also used for the INTERSPEECH 2014
Computational Paralinguistics ChallengE [37].
4.1 Data-Sets
4.1.1 FAU AIBO
FAU AIBO served as the official corpus for the world’s first
international Emotion Challenge [43]. It contains recordings
of children who are interacting with the Sony pet robot
Aibo. It thus contains spontaneous, German speech which is
emotionally coloured. The children were told that the Aibo
robot was responding to their voice commands regarding
directions. However, the robot was in fact controlled by a
human operator, who caused the robot to behaved disobe-
diently sometimes, to provoke strong emotional reactions
from the children. The recordings were performed at two
different schools, referred to as MONT and OHM, from 51
children in total (age 10-13, 21 males, 30 females; approx. 9.2
hours of speech without pauses). The recorded audio was
segmented automatically into speech turns with a speech-
pause threshold of 1 s. The data are labelled for emotional
expression on the word level. As given in [43] five emotion
class labels are used: anger, emphatic, neutral, positive, and
rest. For a two-class valence task, all negative emotions
(Anger and Emphatic—NEG) and all non-negative emo-
tions (Neutral, Positive, and Rest—IDL) are combined.
4.1.2 TUM Audiovisual Interest Corpus (TUM-AVIC)
The TUM Audiovisual Interest Corpus contains audiovisual
recordings of spontaneous affective interactions with non-
restricted spoken content [46]. It was used as data-set for the
INTERSPEECH 2010 Paralinguistics Challenge [36]. In the
set-up, a product presenter walks a subject through a com-
mercial presentation. The language used is English, although
most of the product presenters were German native speakers.
The subjects were mainly from European and Asian national-
ities. 21 subjects (10 female) were recorded in the corpus.
The LOI is labelled for every sub-turn (which are found by
a manual pause based sub-division of speaker turns) in three
labels ranging from boredom (subject is bored with the con-
versation or the topic or both, she/he is very passive and
does not follow the conversation; also referred to as loi1),
over neutral (she/he follows and participates in the conversa-
tion but it can not be judged, whether she/he is interested in
or indifferent towards the topic; also referred to as loi2)tojoy-
ful interaction (showing a strong desire of the subject to talk
and to learn more about the topic, i.e., he/she shows a high
interest in the discussion; also referred to as loi3). For the
evaluations here, all 3,002 phrases (sub-turns) as in [47] are
used—in contrast to the only 996 phrases with high inter-
labeller agreement as, e.g., employed in [46].
4.1.3 Berlin Emotional Speech Database
A very well known and widely used set to test the effective-
ness of automatic emotion classification is the Berlin Emo-
tion Speech Database, also commonly known as EMO-DB.
It was introduced by [48]. It contains sentences spoken in
the emotion categories anger, boredom, disgust, fear, joy,
neutrality, and sadness. The linguistic content is pre-
defined by ten German short sentences, which are emotion-
ally neutral, such as “Der Lappen liegt auf dem Eisschrank”
(The cloth is lying on the fridge.). Ten (five of them female)
professional actors speak 10 sentences in each of the seven
emotional states. While the whole set contains over 700
utterances, in a listening test only 494 phrases are labelled
as a minimum 60 percent naturally sounding and a mini-
mum 80 percent identifiable (with respect to the emotion)
by 20 people. A mean accuracy of 84.3 percent is achieved
for identification of the emotions by the subjects in the lis-
tening experiment on this reduced set of 494 utterances.
This set is used in most other studies related to this database
(cf. [47]), therefore, it is also adopted here.
4.1.4 The Geneva Multimodal Emotion Portrayals
The GEMEP corpus is a collection of 1,260 multimodal emo-
tion expressions enacted by ten French-speaking actors [49].
The list of emotions includes those most frequently encoun-
tered in the literature (e.g., anger, fear, joy, and sadness) as
well as more subtle variations of these categories (e.g., anger
versus irritation, and fear versus anxiety). Specifically, the
12 following emotions are considered, which are distributed
194 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016

Citations
More filters
Proceedings ArticleDOI

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

TL;DR: This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Journal ArticleDOI

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

TL;DR: This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
Proceedings ArticleDOI

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

TL;DR: The challenge guidelines, the common data used, and the performance of the baseline system on the two tasks are presented, to establish to what extent fusion of the approaches is possible and beneficial.
Journal ArticleDOI

Evaluating deep learning architectures for Speech Emotion Recognition

TL;DR: A frame-based formulation to SER is described that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics and is used to empirically explore feed-forward and recurrent neural network architectures and their variants.
Journal ArticleDOI

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

TL;DR: This paper explores how to utilize a DCNN to bridge the affective gap in speech signals, and finds that the DCNN model pretrained for image applications performs reasonably good in affective speech feature extraction.
References
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

Linear prediction: A tutorial review

TL;DR: This paper gives an exposition of linear prediction in the analysis of discrete signals as a linear combination of its past values and present and past values of a hypothetical input to a system whose output is the given signal.
Journal ArticleDOI

Perceptual linear predictive (PLP) analysis of speech

TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Book

Psychoacoustics: Facts and Models

TL;DR: This description of the processing of sound by the human hearing system presents the quantitative relationship between sound stimuli and auditory perceptions in terms of hearing sensations, and implements these relationships in model form.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?

In this paper the authors propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, the authors present a minimalistic set of voice parameters here. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. These were selected based on a ) their potential to index affective physiological changes in voice production, b ) their proven value in former studies as well as their automatic extractability, and c ) their theoretical significance. 

It is expected that further research will strengthen these underpinnings and provide new insights. In the future, therefore, it would be worthwhile to expand their understanding of the acoustic output of affective phonation beyond sound level, pitch and other basic parameters to the underlying, physiologically relevant parameters. The future development of the GeMAPS could include the addition of techniques for inverse filtering the acoustic output signal to directly measure voice source parameter ( see e. g., [ 57 ] ). In the radiated sound this level difference is affected also by the frequency of the first formant mainly, which may be of secondary importance to the affective coloring of phonation. 

Increasing adduction has the effect of lengthening the closed phase and decreasing the amplitude of the transglottal airflow pulses. 

Due to the proven high importance of the fundamental frequency (cf. [6]) and amplitude/intensity, a robust fundamental frequency measure and a pseudo-auditory loudness measure are included in their proposed set. 

large brute-forced feature sets are well known to foster over-adaptation of classifiers to the training data in machine learning problems, reducing their generalisation capabilities to unseen (test) data (cf. [20]). 

During annotation, raters used an icon-based method which let them choose an image from an array of five images for each emotion dimension. 

To allow for vowel-based voice research, and due to their proven relevance for certain tasks, formant parameters are also included in the set. 

The cross-validation is then performed by training eight different models, each on data from seven folds, leaving out the first fold for testing for the first model, the second fold for testing for the second model, and so on. 

The validation experiments were restricted to binary classification experiments in order to allow for best comparability across databases. 

The decision to average only over the higher complexity settings was taken because at complexities lower than this threshold, performance drops significantly for the smaller feature sets, which biases the averaging. 

In total, when combined with the Minimalistic Set, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) contains 88 parameters. 

this should result in attenuation of the voice source fundamental, or, more specifically, in reducing the level difference between the two lowest voice source partials. 

The proposed minimal sets are compared to five large-scale, brute-forced baseline acoustic feature sets of the INTERSPEECH 2009 Emotion Challenge [43] (384 parameters), the INTERSPEECH 2010 Paralinguistic Challenge [36] (1,582 parameters), the INTERSPEECH 2011 Speaker State Challenge [44] (4,368 parameters), the INTERSPEECH 2012 Speaker Trait Challenge [45] (6,125 parameters), and the INTERSPEECH 2013 Computational Paralingusitics ChallengE (ComParE) [12] set (6,373 parameters), which is also used for the INTERSPEECH 2014 Computational Paralinguistics ChallengE [37]. 

The choice of parameters has been guided (and is justified) by three criteria: 1) the potential of an acoustic parameter to index physiological changes in voice production during affective processes, 2) the frequency and success with which the parameter has been used in the past literature (see Section 2), and 3) its theoretical significance (see [1], [2]). 

An early survey [15] and a recent overview [17] nicely summarise a few decades of psychological literature on affective speech research and concludes from the empirical data presented that intensity (loudness), F0 (fundamental frequency) mean, variability, and range, as well as the high frequency content/energy of a speech signal show correlations with prototypical vocal affective expressions such as stress (Intensity, F0 mean), anger and sadness (all parameters), and boredom (F0 variability and range), for example. 

Best result for FAU AIBO obtained with downsampling (not upsampling) because the computational complexity of upsampling with high dimensional parameter sets in relation to the expected accuracy gain was too high.