The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
read more
Citations
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
End-to-End Multimodal Emotion Recognition Using Deep Neural Networks
AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge
Evaluating deep learning architectures for Speech Emotion Recognition
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
References
The WEKA data mining software: an update
Linear prediction: A tutorial review
Perceptual linear predictive (PLP) analysis of speech
Psychoacoustics: Facts and Models
Related Papers (5)
Frequently Asked Questions (16)
Q2. What have the authors stated for future works in "The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing" ?
It is expected that further research will strengthen these underpinnings and provide new insights. In the future, therefore, it would be worthwhile to expand their understanding of the acoustic output of affective phonation beyond sound level, pitch and other basic parameters to the underlying, physiologically relevant parameters. The future development of the GeMAPS could include the addition of techniques for inverse filtering the acoustic output signal to directly measure voice source parameter ( see e. g., [ 57 ] ). In the radiated sound this level difference is affected also by the frequency of the first formant mainly, which may be of secondary importance to the affective coloring of phonation.
Q3. What is the effect of increasing adduction on the glottal airflow?
Increasing adduction has the effect of lengthening the closed phase and decreasing the amplitude of the transglottal airflow pulses.
Q4. Why are the two parameters included in the proposed set?
Due to the proven high importance of the fundamental frequency (cf. [6]) and amplitude/intensity, a robust fundamental frequency measure and a pseudo-auditory loudness measure are included in their proposed set.
Q5. What is the danger of large brute-forced feature sets?
large brute-forced feature sets are well known to foster over-adaptation of classifiers to the training data in machine learning problems, reducing their generalisation capabilities to unseen (test) data (cf. [20]).
Q6. What was the method used to select the image?
During annotation, raters used an icon-based method which let them choose an image from an array of five images for each emotion dimension.
Q7. Why are formant parameters included in the proposed set?
To allow for vowel-based voice research, and due to their proven relevance for certain tasks, formant parameters are also included in the set.
Q8. What is the procedure used for the cross-validation of the arousal labels?
The cross-validation is then performed by training eight different models, each on data from seven folds, leaving out the first fold for testing for the first model, the second fold for testing for the second model, and so on.
Q9. Why was the validation experiments restricted to binary?
The validation experiments were restricted to binary classification experiments in order to allow for best comparability across databases.
Q10. What is the reason for the decision to average only over the higher complexity settings?
The decision to average only over the higher complexity settings was taken because at complexities lower than this threshold, performance drops significantly for the smaller feature sets, which biases the averaging.
Q11. How many parameters are included in the extended Geneva Minimalistic Acoustic Parameter Set?
In total, when combined with the Minimalistic Set, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) contains 88 parameters.
Q12. What is the effect of increasing adduction on the voice source?
this should result in attenuation of the voice source fundamental, or, more specifically, in reducing the level difference between the two lowest voice source partials.
Q13. How many parameters are included in the proposed minimal sets?
The proposed minimal sets are compared to five large-scale, brute-forced baseline acoustic feature sets of the INTERSPEECH 2009 Emotion Challenge [43] (384 parameters), the INTERSPEECH 2010 Paralinguistic Challenge [36] (1,582 parameters), the INTERSPEECH 2011 Speaker State Challenge [44] (4,368 parameters), the INTERSPEECH 2012 Speaker Trait Challenge [45] (6,125 parameters), and the INTERSPEECH 2013 Computational Paralingusitics ChallengE (ComParE) [12] set (6,373 parameters), which is also used for the INTERSPEECH 2014 Computational Paralinguistics ChallengE [37].
Q14. What criteria are used to guide the selection of acoustic parameters?
The choice of parameters has been guided (and is justified) by three criteria: 1) the potential of an acoustic parameter to index physiological changes in voice production during affective processes, 2) the frequency and success with which the parameter has been used in the past literature (see Section 2), and 3) its theoretical significance (see [1], [2]).
Q15. What is the recent overview of affective speech research?
An early survey [15] and a recent overview [17] nicely summarise a few decades of psychological literature on affective speech research and concludes from the empirical data presented that intensity (loudness), F0 (fundamental frequency) mean, variability, and range, as well as the high frequency content/energy of a speech signal show correlations with prototypical vocal affective expressions such as stress (Intensity, F0 mean), anger and sadness (all parameters), and boredom (F0 variability and range), for example.
Q16. Why is the result for FAU AIBO obtained with downsampling?
Best result for FAU AIBO obtained with downsampling (not upsampling) because the computational complexity of upsampling with high dimensional parameter sets in relation to the expected accuracy gain was too high.