What is the effect of increasing adduction on the glottal airflow?

Increasing adduction has the effect of lengthening the closed phase and decreasing the amplitude of the transglottal airflow pulses.

What is the danger of large brute-forced feature sets?

large brute-forced feature sets are well known to foster over-adaptation of classifiers to the training data in machine learning problems, reducing their generalisation capabilities to unseen (test) data (cf. [20]).

What was the method used to select the image?

During annotation, raters used an icon-based method which let them choose an image from an array of five images for each emotion dimension.

What is the procedure used for the cross-validation of the arousal labels?

The cross-validation is then performed by training eight different models, each on data from seven folds, leaving out the first fold for testing for the first model, the second fold for testing for the second model, and so on.

Why was the validation experiments restricted to binary?

The validation experiments were restricted to binary classification experiments in order to allow for best comparability across databases.

What is the reason for the decision to average only over the higher complexity settings?

The decision to average only over the higher complexity settings was taken because at complexities lower than this threshold, performance drops significantly for the smaller feature sets, which biases the averaging.

What is the effect of increasing adduction on the voice source?

this should result in attenuation of the voice source fundamental, or, more specifically, in reducing the level difference between the two lowest voice source partials.

Why is the result for FAU AIBO obtained with downsampling?

Best result for FAU AIBO obtained with downsampling (not upsampling) because the computational complexity of upsampling with high dimensional parameter sets in relation to the expected accuracy gain was too high.

(Open Access) The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing (2016) | Florian Eyben

Q: What contributions have the authors mentioned in the paper "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?

In this paper the authors propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, the authors present a minimalistic set of voice parameters here. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. These were selected based on a ) their potential to index affective physiological changes in voice production, b ) their proven value in former studies as well as their automatic extractability, and c ) their theoretical significance.

Q: What have the authors stated for future works in "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?

It is expected that further research will strengthen these underpinnings and provide new insights. In the future, therefore, it would be worthwhile to expand their understanding of the acoustic output of affective phonation beyond sound level, pitch and other basic parameters to the underlying, physiologically relevant parameters. The future development of the GeMAPS could include the addition of techniques for inverse filtering the acoustic output signal to directly measure voice source parameter ( see e. g., [ 57 ] ). In the radiated sound this level difference is affected also by the frequency of the first formant mainly, which may be of secondary importance to the affective coloring of phonation.

The Geneva Minimalistic Acoustic Parameter

Set (GeMAPS) for Voice Research and

Affective Computing

Florian Eyben, Klaus R. Scherer, Bj

€

orn W. Schuller, Johan Sundberg, Elisabeth Andr



e, Carlos Busso,

Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong

Abstract—Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively

and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards

become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across

studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard

acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a

large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their

potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic

extractability, and c) their theoretical signiﬁcance. The set is intended to provide a common baseline for evaluation of future research

and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our

implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large

baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size.

Index Terms—Affective computing, acoustic features, standard, emotion recognition, speech analysis, geneva minimalistic parameter set

1INTRODUCTION

NTEREST in the vocal expression of different affect states

has a long history with researchers working in various

ﬁelds of research ranging from psychiatry to engineering.

Psychiatrists have been attempting to diagnose affective

states. Psychologists and communication researchers have

been exploring the capacity of the voice to carry signals of

emotion. Linguists and phoneticians have been discovering

the role of affective pragmatic information in language pro-

duction and perception. More recently, computer scientists

and engineers have been attempting to automatically recog-

nize and manipulate speaker attitudes and emotions to ren-

der information technology more accessible and credible for

human users. Much of this research and development uses

the extraction of acoustic parameters from the speech signal

as a method to understand the patterning of the vocal expres-

sion of different emotions and other affective dispositions

and processes. The underlying theoretical assumption is that

affective processes differentially change autonomic arousal

and the tension of the striate musculature and thereby affect

voice and speech production on the phonatory and articula-

tory level and that these changes can be estimated by differ-

ent parameters of the acoustic waveform [1].

Emotional cues conveyed in the voice have been empiri-

cally documented recently by the measurement of emotion-

differentiating parameters related to subglottal pressure,

transglottal airﬂow, and vocal fold vibration ([2], [3], [4], [5],

[6], [7], [8]). Mostly based on established procedures in pho-

netics and speech sciences to measure different aspects of

phonation and articulation in speech, researchers have used

a large number of acoustic parameters (see [9]; [10], for

overviews), including parameters in the Time domain (e.g.,

speech rate), the Frequency domain (e.g., fundamental fre-

quency (F

) or formant frequencies), the Amplitude domain

(e.g., intensity or energy), and the Spectral Energy domain

(e.g., relative energy in different frequency bands). Not

all of these parameters have been standardized in terms of

 F. Eyben is with audEERING UG, Gilching, Germany, Technische

Universit

€

at, M

€

unchen, Germany, and the Swiss Centre for Affective

Sciences, Geneva, Switzerland. E-mail: fe@audeering.com.

 K. R. Scherer is with the Swiss Centre for Affective Sciences, and

Universit



e de Gen



eve, Geneva, Switzerland, University of Munich,

Munich, Germany. E-mail: Klaus.Scherer@unige.ch.

 B. W. Schuller is with the Chair of Complex & Intelligent Systems,

University of Passau, Passau, Germany, and the Deparment of Comput-

ing, Imperial College, London, U.K., audEERING UG, Gilching, and the

Swiss Centre for Affective Sciences, Geneva, Switzerland.

E-mail: schuller@tum.de.

 J. Sundberg is with KTH Royal Institute of Technology, Stockholm,

Sweden. E-mail: pjohan@speech.kth.se.

 E. Andr



e is with the Faculty of Applied Computer Science, Universit

€

Augsburg, Germany. E-mail: andre@informatik.uni-augsburg.de.

 C. Busso is with the Department of Electrical Engineering, University of

Texas, Dallas, TX, USA. E-mail: busso@utdallas.edu.

 L. Y. Devillers is with University of Paris-Sorbonne IV and CNRS/LIMSI,

Paris, France. E-mail: devil@limsi.fr.

 J. Epps is with University of New South Wales, Sydney, Australia and

NICTA ATP Laboratory, Eveleigh, Australia. E-mail: j.epps@unsw.edu.au.

 P. Laukka is with Stockholm University, Stockholm, Sweden.

E-mail: petri.laukka@psychology.su.se.

 S. S. Narayanan is with SAIL, University of Southern California, Los

Angeles, CA, USA. E-mail: shri@sipi.usc.edu.

 K. P. Truong is with the Department of Human Media Interaction, Univer-

sity of Twente, Enschede, The Netherlands. E-mail: k.p.truong@utwente.nl.

Manuscript received 17 Nov. 2014; accepted 2 June 2015. Date of publication

15 July 2015; date of current version 6 June 2016.

Recommended for acceptance by K. Hirose.

For information on obtaining reprints of this article, please send e-mail to:

reprints@ieee.org, and reference the Digital Object Identiﬁer below.

Digital Object Identiﬁer no. 10.1109/TAFFC.2015.2457417

190 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/

their exact computation and thus results reported in the lit-

erature cannot always be easily compared. Even where

parameters have been extracted using widely used tools

like Praat [11], the exact settings used are not usually easily

and publicly accessible. Furthermore, different studies often

use sets of acoustic features that overlap only partially,

again rendering comparison of results across studies

exceedingly difﬁcult and thus endangering the cumulation

of empirical evidence. The recent use of machine learning

algorithms for the recognition of affective states in speech

has led to a proliferation in the variety and quantity of

acoustic features employed, amounting often to several

thousand basic (low-level) and derived (functionals) param-

eters (e.g., [12]). While this profusion of parameters allows

to capture many acoustic characteristics in a comprehensive

and reliable manner, this comes at the cost of serious difﬁ-

culties in the interpretation of the underlying mechanisms.

However, applications such as the ﬁne grained control of

emotionality in speech synthesis (cf. [13], [14]), or dimen-

sional approaches to emotion and mental state recognition

that seek to quantify arousal, valence or depression severity,

for example, along a single axis, all require a deeper under-

standing of the mechanism of production and perception of

emotion in humans. To reach this understanding, ﬁnding

and interpreting relevant acoustic parameters is crucial.

Thus, based on many previous ﬁndings in the area of

speech and voice analysis (e.g., [2], [9], [15], [16], [17], [18],

[19]), in this article the authors present a recommendation

for a minimalistic standard parameter set for the acoustic

analysis of speech and other vocal sounds. This standard

set is intended to encourage researchers in this area to adopt

it as a baseline and use it alongside any speciﬁc parameters

of particular interest to individual researchers or groups, to

allow replication of ﬁndings, comparison between studies

from different laboratories, and greater cumulative insight

from the efforts of different laboratories on vocal concomi-

tants of affective processes.

Moreover, large brute-forced feature sets are well known

to foster over-adaptation of classiﬁers to the training data in

machine learning problems, reducing their generalisation

capabilities to unseen (test) data (cf. [20]). Minimalistic

parameter sets might reduce this danger and lead to better

generalisation in cross-corpus experiments and ultimately

in real-world test scenarios. Further, as mentioned above,

the interpretation of the meaning of the parameters in a

minimalistic set is much easier than in large brute-forced

sets, where this is nearly impossible.

The remainder of this article is structured as follows: First,

Section 2 provides a brief overview of acoustic analyses in the

ﬁelds of psychology, phonetics, acoustics, and engineering

which are the basis of the recommendation proposed in this

article; next, in Section 3 we give a detailed description of the

acoustic parameters contained in the recommended parame-

ter set and the implementation thereof. The parameter set is

extensively evaluated on six well-known affective speech

databases and the classiﬁcation performance is compared to

all high-dimensional brute-forced sets of the INTERSPEECH

Challenges on Emotion and Paralinguistics from 2009 to 2013

in Section 4. Final remarks on the parameters recommended

in this article and the classiﬁcation performance relative to

other established sets as well as a discussion on the direction

of future research in this ﬁeld are given in Section 5.

2RELATED WORK

The minimalistic feature set proposed in this article is not

the ﬁrst joint attempt to standardise acoustic parameter sets.

The CEICES initiative [21], for example, brought researchers

together who were working on identiﬁcation of emotional

states from the voice. They combined the acoustic parame-

ters they had used in their individual work in a systematic

way in order to create large, brute-forced parameter sets,

and thereby identify individual parameters by a unique

naming (code) scheme. However, the exact implementation

of the individual parameters was not well standardised.

CEICES was a more engineering-driven “collector” appro-

ach where parameters which were successful in classiﬁca-

tion experiments were all included, while GeMAPS is a

more interdisciplinary attempt to agree on a minimalistic

parameter set based on multiple source, interdisciplinary

evidence and theoretical signiﬁcance or a few parameters.

Related programs for computation of acoustic parame-

ters, which are used by both linguists and computer science

researchers, include the popular Praat toolkit [11] or

Wavesurfer

This section gives a literature overview on studies where

parameters that form the basis of our recommendation have

been proposed and used for voice analysis and related

ﬁelds.

An early survey [15] and a recent overview [17] nicely

summarise a few decades of psychological literature on

affective speech research and concludes from the empirical

data presented that intensity (loudness), F0 (fundamental

frequency) mean, variability, and range, as well as the high

frequency content/energy of a speech signal show correla-

tions with prototypical vocal affective expressions such as

stress (Intensity, F0 mean), anger and sadness (all parame-

ters), and boredom (F0 variability and range), for example.

Further, speech and articulation rate was found to be impor-

tant for all emotional expressions. For the case of automatic

arousal recognition, [22] successfully builds an unsuper-

vised recognition framework with these descriptors.

Hammerschmidt and J

€

urgens[16] perform acoustic anal-

ysis of various fundamental frequency and harmonics

related parameters on a small set of emotional speech utter-

ances. The ﬁndings conﬁrm that parameters related to F0

and spectral distribution are important cues to affective

speech content. Hammerschmidt and J

€

urgens[16] introduce

a ratio of the peak frequency to the fundamental frequency,

and use spectral roll-off points (called distribution of fre-

quency—DFB—there). More recently, [18], also validate the

discriminatory power of amplitude, pitch, and spectral pro-

ﬁle (tilt, balance, distribution) parameters for a larger set of

vocal emotional expressions.

Most studies, such as the two previously mentioned, deal

with the analysis of acoustic arousal and report fairly con-

sistent parameters which are cues to vocal arousal (nicely

summarised by [17]). The original ﬁndings that prosodic

parameters (F0 and intensity) are relevant for arousal have

been conﬁrmed in many similar studies, such as [4], and

more automatic, machine learning based parameter evalua-

tion studies such as [23]. Regarding energy/intensity, [24]

1. http://www.speech.kth.se/wavesurfer/

EYBEN ET AL.: THE GENEVA MINIMALISTIC ACOUSTIC PARAMETER SET (GEMAPS) FOR VOICE RESEARCH AND AFFECTIVE COMPUTING 191

shows that a loudness measure, in which the signal energy

in various frequency bands is weighted according to the

human-hearing’s frequency sensitivity, is better correlated

to vocal affect dimensions than the simple signal energy

alone. Further, it is shown there, that spectral ﬂux has the

overall best correlation for a single feature.

Recent work, such as [17] and [25], has dealt with other

dimensions besides arousal—in particular valence (both)

and the level of interest (LOI) [25]. For valence both of these

studies conclude that spectral shape parameters could be

important cues for vocal valence. Also, rhythm related

parameters, such as speaking rate are correlated with

valence. Tahon and Devillers [26] conﬁrms the importance

of various spectral band energies, spectral slope, overall

intensity, and the variance of the fundamental frequency,

for the detection of angry speech. These parameters were

also reported to be important for cognitive load [27] and

psychomotor retardation [28].

Eyben et al. [25] also show a large importance of cepstral

parameters (Mel-Frequency-Cepstral-Coefﬁcients—MFCC),

especially for LOI. These are closely related to spectral

shape parameters. Especially the lower order MFCC, resem-

ble spectral tilt (slope) measures to some extent over the full

range of the spectrum (ﬁrst coefﬁcient), or in various

smaller sub-bands (second and higher coefﬁcient). The rele-

vance of spectral slope and shape is also investigated and

conﬁrmed by [29], for example, and by [30] and [31].

In contrast to the ﬁndings in [15], for example, [25] sug-

gests that the relative importance of prosodic parameters as

well as voice quality parameters decreases in the case of

degraded audio conditions (background noise, reverbera-

tion), while the relative importance of spectral shape param-

eters increases. This is likely due to degraded accuracy in the

estimation of the prosodic parameters such as due to interfer-

ing harmonics or energy contributed by the noise compo-

nents. Overall, we believe that the lower order MFCC are

important to consider for various tasks and thus we include

MFCC 1-4 in the parameter set proposed in this article.

For automatic classiﬁcation, large-scale brute-force acous-

tic parameter sets are used (e.g., [12], [32], [33], [34]). These

contain parameters which are easily and reliably computable

from acoustic signals. The general tendency in most studies

is, that larger parameter sets perform better [34]. This might

be due to the fact that in larger feature sets the ‘right’ features

are more likely present, or due to the fact that the combina-

tion of all features is necessary. Another reason might be that

with this many parameters (over 6,000 in some cases), the

machine learning methods simply over-adapt to the (rather)

small training data-sets. This is evident especially in cross-

corpus classiﬁcation experiments, where the large feature

sets show poorer performance despite their higher perfor-

mance in intra-corpus evaluations [20]. As said, it is thus our

aim in this article to select relevant parameters, guided by

the ﬁndings of previous, related studies.

Besides vocal emotional expressions, there are numerous

other studies which deal with other vocal phenomena and

ﬁnd similar and very related features to be important. [27], for

example, shows the importance of vowel-based formant fre-

quency statistics, and [5], for example, shows the usefulness

of glottal features when combined with prosodic features for

identiﬁcation of depression in speech. Voice source features,

in particular the harmonic difference H1-H2, showed a con-

sistent decrease with increasing cognitive load, based on a

study employing manually corrected pitch estimates [35].

Recently, researchers have attempted to analyse further para-

linguistic characteristics of speech, ranging from age and gen-

der [36], to cognitive and physical load [37], for example.

Many automatically extracted brute-force parameter sets

neglect formant parameters due to difﬁculties in extracting

them reliably. For voice research and automatic classiﬁca-

tion, they are very important though. Formants have been

shown sensitive to many forms of emotion and mental state

and formants give approximately state of the art cognitive

load classiﬁcation results [27] and depression recognition

and assessment results [31], [38], and can provide competi-

tive emotion recognition performance [39] with a fraction of

the feature dimension of other systems. A basic set of for-

mant related features is thus included in our proposed set.

Due to the proven high importance of the fundamental

frequency (cf. [6]) and amplitude/intensity, a robust funda-

mental frequency measure and a pseudo-auditory loudness

measure are included in our proposed set. A wide variety

of statistics are applied to both parameters over time, in

order to capture distributional changes. To robustly repre-

sent the high frequency content and the spectral balance,

the descriptors alpha ratio, Hammarberg index, and spec-

tral slope are considered in this article. The vocal timbre is

encoded by Mel-Frequency Cepstral Coefﬁcients, and the

quality of the vocal excitation signal by the period-to-period

jitter and shimmer of F0. To allow for vowel-based voice

research, and due to their proven relevance for certain tasks,

formant parameters are also included in the set.

3ACOUSTIC PARAMETER RECOMMENDATION

The recommendation presented here has been conceived at

an interdisciplinary meeting of voice and speech scientists

in Geneva

and further developed at Technische Universit

€

unchen (TUM). The choice of parameters has been guided

(and is justiﬁed) by three criteria: 1) the potential of an

acoustic parameter to index physiological changes in voice

production during affective processes, 2) the frequency and

success with which the parameter has been used in the past

literature (see Section 2), and 3) its theoretical signiﬁcance

(see [1], [2]).

Two versions of the acoustic parameter set recommenda-

tion are proposed here: a minimalistic set of parameters,

which implements prosodic, excitation, vocal tract, and

spectral descriptors found to be most important in previous

work of the authors, and an extension to the minimalistic

set, which contains a small set of cepstral descriptors,

which—from the literature (e.g., [40])—are consistently

known to increase the accuracy of automatic affect recogni-

tion over a pure prosodic and spectral parameter set. Sev-

eral studies on automatic parameter selection, such as [23],

[24], suggest that the lower order MFCCs are more

2. Conference organised by K. Scherer, B. Schuller, and J. Sundberg

on September 1–2, 2013 at the Swiss Center of Affective Sciences in

Geneva on Measuring affect and emotion in vocal communication via acous-

tic feature extraction: State of the art, current research, and benchmarking

with the explicit aim of commonly working towards a recommendation

for a reference set of acoustic parameters to be broadly used in the ﬁeld.

192 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016

important for affect and paralinguistic voice analysis tasks.

When looking at the underlying Discrete Cosine Transfor-

mation (DCT-II) base functions used when computing

MFCCs, it is evident that the lower order MFCC are related

to spectral tilt and the overall distribution of spectral

energy. Higher order MFCCs would reﬂect more ﬁne

grained energy distributions, which are presumably more

important to identify phonetic content than non-verbal

voice attributes.

To encourage rapid community discussion on the param-

eter sets, as well as updates and additions from the commu-

nity, a wiki-page

has been set up, where researchers can

quickly connect and discuss issues with the parameter set.

New ideas, if they are favoured by multiple contributors,

will then be implemented and after a certain number of

improvements or after a certain time frame, new versions of

the parameter sets will be released publicly.

In the following sections, we ﬁrst give an overview over

the minimalistic parameter recommendation (Section 3.1),

and the extended parameter set (Section 3.2), before describ-

ing details of the algorithms used to compute the parame-

ters in Section 6.1.

3.1 Minimalistic Parameter Set

The minimalistic acoustic parameter set contains the follow-

ing compact set of 18 low-level descriptors (LLD), sorted by

parameter groups:

Frequency related parameters:

 Pitch, logarithmic F

on a semitone frequency scale,

starting at 27.5 Hz (semitone 0).

 Jitter, deviations in individual consecutive F

period

lengths.

 Formant 1, 2, and 3 frequency, centre frequency of

ﬁrst, second, and third formant

 Formant 1, bandwidth of ﬁrst formant.

Energy/Amplitude related parameters:

 Shimmer, difference of the peak amplitudes of con-

secutive F

periods.

 Loudness, estimate of perceived signal intensity

from an auditory spectrum.

 Harmonics-to-noise ratio (HNR), relation of energy

in harmonic components to energy in noise-like

components.

Spectral (balance) parameters:

 Alpha Ratio, ratio of the summed energy from

50-1000 Hz and 1-5 kHz

 Hammarberg Index, ratio of the strongest energy

peak in the 0-2 kHz region to the strongest peak in

the 2–5 kHz region.

 Spectral Slope 0-500 Hz and 500-1500 Hz, linear

regression slope of the logarithmic power spectrum

within the two given bands.

 Formant 1, 2, and 3 relative energy, as well as the

ratio of the energy of the spectral harmonic peak at

the ﬁrst, second, third formant’s centre frequency to

the energy of the spectral peak at F

 Harmonic difference H1-H2, ratio of energy of the

ﬁrst F

harmonic (H1) to the energy of the second F

harmonic (H2).

 Harmonic difference H1-A3, ratio of energy of the

ﬁrst F

harmonic (H1) to the energy of the highest

harmonic in the third formant range (A3).

All LLD are smoothed over time with a symmetric mov-

ing average ﬁlter 3 frames long (for pitch, jitter, and shim-

mer, the smoothing is only performed within voiced

regions, i.e., not smoothing the transitions from 0

(unvoiced) to non 0). Arithmetic mean and coefﬁcient of varia-

tion (standard deviation normalised by the arithmetic

mean) are applied as functionals to all 18 LLD, yielding 36

parameters. To loudness and pitch the following 8 functionals

are additionally applied: 20th, 50th, and 80th percentile, the

range of 20th to 80th percentile, and the mean and standard

deviation of the slope of rising/falling signal parts. All function-

als are applied to voiced regions only (non-zero F0), with

the exception of all the functionals which are applied to

loudness. This gives a total of 52 parameters. Also, the arith-

metic mean of the Alpha Ratio, the Hammarberg Index, and

the spectral slopes from 0-500 Hz and 500-1500 Hz over all

unvoiced segments are included, totalling 56 parameters. In

addition, six temporal features are included:

 the rate of loudness peaks, i.e., the number of loud-

ness peaks per second,

 the mean length and the standard deviation of con-

tinuously voiced regions (F

> 0),

 the mean length and the standard deviation of

unvoiced regions ( F

¼ 0; approximating pauses),

 the number of continuous voiced regions per sec-

ond (pseudo syllable rate).

No minimal length is imposed on voiced or unvoiced

regions, i.e., in the extreme case they could be only one

frame long. The Viterbi-based smoothing of the F

contour,

however, prevents single voiced frames which are, e.g.,

missing by error effectively. In total, 62 parameters are con-

tained in the Geneva Minimalistic Standard Parameter Set.

3.2 Extended Parameter Set

The minimalistic set does not contain any cepstral parame-

ters and only very few dynamic parameters (i.e., it contains

no delta regression coefﬁcients and no difference features;

only the slopes of rising and falling F

and loudness seg-

ments encapsulate some dynamic information). Further,

especially cepstral parameters have proven highly success-

ful in modelling of affective states, e.g., by [23], [40], [41].

Thus, an extension set to the minimalistic set is proposed

which contains the following seven LLD in addition to the

18 LLD in the minimalistic set:

Spectral (balance/shape/dynamics) parameters:

 MFCC 1-4 Mel-Frequency Cepstral Coefﬁcients 1-4.

 Spectral ﬂux difference of the spectra of two conse-

cutive frames.

Frequency related parameters:

 Formant 2-3 bandwidth added for completeness of

Formant 1-3 parameters.

As functionals, the arithmetic mean and the coefﬁcient of var-

iation are applied to all of these seven additional LLD to all

3. http://www.audeering.com/research/gemaps

EYBEN ET AL.: THE GENEVA MINIMALISTIC ACOUSTIC PARAMETER SET (GEMAPS) FOR VOICE RESEARCH AND AFFECTIVE COMPUTING 193

segments (voiced and unvoiced together), except for the for-

mant bandwidths to which the functionals are applied only

in voiced regions. This adds 14 extra descriptors. Addition-

ally, the arithmetic mean of the spectral ﬂux in unvoiced

regions only, the arithmetic mean and coefﬁcient of varia-

tion of the spectral ﬂux and MFCC 1-4 in voiced regions

only is included. This results in another 11 descriptors.

Additionally the equivalent sound level is included. This

results in 26 extra parameters. In total, when combined with

the Minimalistic Set, the extended Geneva Minimalistic Acous-

tic Parameter Set (eGeMAPS) contains 88 parameters.

4BASELINE E VALUATION

The proposed minimalistic parameter set and the extended

set are both evaluated for the task of automatic recognition

in binary arousal and binary valence dimensions. The origi-

nal labels (mixed various categories and continuous dimen-

sional) of six standard databases of affective speech were

mapped to binary dimensional labels (Arousal/Valence), as

described in Section 4.2 in order to enable a fair comparison

of performances on these databases.

The original labels (cf. Section 4.1 for details on the data-

bases) are: Levels of Interest (TUM AVIC database), acted

speech emotions in the Geneva Multimodal Emotion Por-

trayals (GEMEP) corpus and the German Berlin Emotional

Speech database (EMO-DB), emotions portrayed in the sing-

ing voice of professional opera singers (GeSiE), valence in

childrens’ speech from the FAU AIBO corpus [42] as used

for the INTERSPEECH 2009 Emotion Challenge [43], as well

as real-life emotions from German talk-show recordings

(Vera-am-Mittag corpus (VAM)). The proposed minimal

sets are compared to ﬁve large-scale, brute-forced baseline

acoustic feature sets of the INTERSPEECH 2009 Emotion

Challenge [43] (384 parameters), the INTERSPEECH 2010

Paralinguistic Challenge [36] (1,582 parameters), the INTER-

SPEECH 2011 Speaker State Challenge [44] (4,368 parame-

ters), the INTERSPEECH 2012 Speaker Trait Challenge [45]

(6,125 parameters), and the INTERSPEECH 2013 Computa-

tional Paralingusitics ChallengE (ComParE) [12] set (6,373

parameters), which is also used for the INTERSPEECH 2014

Computational Paralinguistics ChallengE [37].

4.1 Data-Sets

4.1.1 FAU AIBO

FAU AIBO served as the ofﬁcial corpus for the world’s ﬁrst

international Emotion Challenge [43]. It contains recordings

of children who are interacting with the Sony pet robot

Aibo. It thus contains spontaneous, German speech which is

emotionally coloured. The children were told that the Aibo

robot was responding to their voice commands regarding

directions. However, the robot was in fact controlled by a

human operator, who caused the robot to behaved disobe-

diently sometimes, to provoke strong emotional reactions

from the children. The recordings were performed at two

different schools, referred to as MONT and OHM, from 51

children in total (age 10-13, 21 males, 30 females; approx. 9.2

hours of speech without pauses). The recorded audio was

segmented automatically into speech turns with a speech-

pause threshold of 1 s. The data are labelled for emotional

expression on the word level. As given in [43] ﬁve emotion

class labels are used: anger, emphatic, neutral, positive, and

rest. For a two-class valence task, all negative emotions

(Anger and Emphatic—NEG) and all non-negative emo-

tions (Neutral, Positive, and Rest—IDL) are combined.

4.1.2 TUM Audiovisual Interest Corpus (TUM-AVIC)

The TUM Audiovisual Interest Corpus contains audiovisual

recordings of spontaneous affective interactions with non-

restricted spoken content [46]. It was used as data-set for the

INTERSPEECH 2010 Paralinguistics Challenge [36]. In the

set-up, a product presenter walks a subject through a com-

mercial presentation. The language used is English, although

most of the product presenters were German native speakers.

The subjects were mainly from European and Asian national-

ities. 21 subjects (10 female) were recorded in the corpus.

The LOI is labelled for every sub-turn (which are found by

a manual pause based sub-division of speaker turns) in three

labels ranging from boredom (subject is bored with the con-

versation or the topic or both, she/he is very passive and

does not follow the conversation; also referred to as loi1),

over neutral (she/he follows and participates in the conversa-

tion but it can not be judged, whether she/he is interested in

or indifferent towards the topic; also referred to as loi2)tojoy-

ful interaction (showing a strong desire of the subject to talk

and to learn more about the topic, i.e., he/she shows a high

interest in the discussion; also referred to as loi3). For the

evaluations here, all 3,002 phrases (sub-turns) as in [47] are

used—in contrast to the only 996 phrases with high inter-

labeller agreement as, e.g., employed in [46].

4.1.3 Berlin Emotional Speech Database

A very well known and widely used set to test the effective-

ness of automatic emotion classiﬁcation is the Berlin Emo-

tion Speech Database, also commonly known as EMO-DB.

It was introduced by [48]. It contains sentences spoken in

the emotion categories anger, boredom, disgust, fear, joy,

neutrality, and sadness. The linguistic content is pre-

deﬁned by ten German short sentences, which are emotion-

ally neutral, such as “Der Lappen liegt auf dem Eisschrank”

(The cloth is lying on the fridge.). Ten (ﬁve of them female)

professional actors speak 10 sentences in each of the seven

emotional states. While the whole set contains over 700

utterances, in a listening test only 494 phrases are labelled

as a minimum 60 percent naturally sounding and a mini-

mum 80 percent identiﬁable (with respect to the emotion)

by 20 people. A mean accuracy of 84.3 percent is achieved

for identiﬁcation of the emotions by the subjects in the lis-

tening experiment on this reduced set of 494 utterances.

This set is used in most other studies related to this database

(cf. [47]), therefore, it is also adopted here.

4.1.4 The Geneva Multimodal Emotion Portrayals

The GEMEP corpus is a collection of 1,260 multimodal emo-

tion expressions enacted by ten French-speaking actors [49].

The list of emotions includes those most frequently encoun-

tered in the literature (e.g., anger, fear, joy, and sadness) as

well as more subtle variations of these categories (e.g., anger

versus irritation, and fear versus anxiety). Speciﬁcally, the

12 following emotions are considered, which are distributed

194 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 7, NO. 2, APRIL-JUNE 2016

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

Figures

Citations

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Evaluating deep learning architectures for Speech Emotion Recognition

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

References

The WEKA data mining software: an update

Praat, a system for doing phonetics by computer

Linear prediction: A tutorial review

Perceptual linear predictive (PLP) analysis of speech

Psychoacoustics: Facts and Models

Related Papers (5)

Recent developments in openSMILE, the munich open-source multimedia feature extractor

Opensmile: the munich versatile and fast open-source audio feature extractor

IEMOCAP: interactive emotional dyadic motion capture database

A database of German emotional speech.

Survey on speech emotion recognition: Features, classification schemes, and databases

Frequently Asked Questions (16)

Q1. What contributions have the authors mentioned in the paper "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?

Q2. What have the authors stated for future works in "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?

Q3. What is the effect of increasing adduction on the glottal airflow?

Q4. Why are the two parameters included in the proposed set?

Q5. What is the danger of large brute-forced feature sets?

Q6. What was the method used to select the image?

Q7. Why are formant parameters included in the proposed set?

Q8. What is the procedure used for the cross-validation of the arousal labels?

Q9. Why was the validation experiments restricted to binary?

Q10. What is the reason for the decision to average only over the higher complexity settings?

Q11. How many parameters are included in the extended Geneva Minimalistic Acoustic Parameter Set?

Q12. What is the effect of increasing adduction on the voice source?

Q13. How many parameters are included in the proposed minimal sets?

Q14. What criteria are used to guide the selection of acoustic parameters?

Q15. What is the recent overview of affective speech research?

Q16. Why is the result for FAU AIBO obtained with downsampling?