scispace - formally typeset
Open AccessJournal ArticleDOI

Synthesizing multimodal utterances for conversational agents

Stefan Kopp, +1 more
- 01 Mar 2004 - 
- Vol. 15, Iss: 1, pp 39-52
TLDR
An incremental production model is presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors with mechanisms for linking them in fluent utterances with natural co‐articulation and transition effects.
Abstract
Conversational agents are supposed to combine speech with non-verbal modalities for intelligible multimodal utterances. In this paper, we focus on the generation of gesture and speech from XML-based descriptions of their overt form. An incremental production model is presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors with mechanisms for linking them in fluent utterances with natural co-articulation and transition effects. In particular, an efficient kinematic approach for animating hand gestures from shape specifications is presented, which provides fine adaptation to temporal constraints that are imposed by cross-modal synchrony. Copyright (C) 2004 John Wiley Sons, Ltd.

read more

Content maybe subject to copyright    Report

COMPUTER ANIMATION AND VIRTUAL WORLDS
Comp. Anim. Virtual Worlds 2004; 15: 39–52 (DOI: 10.1002/cav.6)
******************************************************************************************************
Synthesizing multimodal utterances for
conversational agents
By Stefan Kopp* and Ipke Wachsmuth
******************************** *** **************** *** **************** ******* *** ****
Conversational agents are supposed to combine speech with non-verbal modalities for
intelligible multimodal utterances. In this paper, we focus on the generation of gesture and
speech from XML-based descriptions of their overt form. An incremental production model is
presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors
with mechanisms for linking them in fluent utterances with natural co-articulation and
transition effects. In particular, an efficient kinematic approach for animating hand gestures
from shape specifications is presented, which provides fine adaptation to temporal
constraints that are imposed by cross-modal synchrony. Copyright # 2004 John Wiley &
Sons, Ltd.
Received: 7 March 2003; Revised: 30 June 2003
KEY WORDS: multimodal conversational agents; gesture animation; model-based computer
animation; motion control
Introduction
Techniques from artificial intelligence, computer anima-
tion, and human–computer interaction are increasingly
converging in the field of embodied conversational
agents.
1
Such agents are envisioned to have similar
properties to humans in face-to-face communication,
including the ability to generate simultaneous verbal
and non-verbal behaviours. This includes co-verbal
gestures that humans frequently produce during speech
to emphasize, clarify or even complement the convey-
ance of central parts of an utterance.
Current conversational agents, e.g. the real estate agent
REA,
2
the pedagogical agent Steve
3
or in the BEAT
system,
4
generate their multimodal utterances by, first,
planning the communicative acts to be performed and,
secondly, synthesizing verbal and non-verbal behaviors.
The latter stage involves the generation of appropriate,
intelligible verbal and gestural acts per se as well as their
combination in a seamless, human-like flow of multi-
modal behaviour. At the same time, verbal and non-
verbal behaviours have to be finely synchronized at
distinct points of time to ensure coherence of the result-
ing utterances. For example, the co-expressive elements
in speech and co-verbal gesture appear in semantically
and pragmatically coordinated form
5
andvitally
importantin temporal synchrony even at the level of
single syllables. Meeting these demands for synchrony,
continuity, and lifelikeness simultaneously poses contin-
uous problems for the automatic synthesis of multimodal
utterances in conversational agents.
In our lab, the anthropomorphic agent Max is under
development. Max acts as an assembly expert in an
immersive 3D virtual environment (see Figure 1 for
the overall scenario). The agent demonstrates assembly
procedures to the user by combining facial and upper
limb gestures with spoken utterances. An important
aspect in synthesizing Max’s utterances is the real-
time creation of synchronized gestural and verbal beha-
viors from application-independent descriptions of
their outer form. Such descriptions are supposed to be
created during high-level utterance planning and to be
specified in MURML, an XML-based representation
language.
6
In this paper, we present the utterance
production model employed in Max with a focus on
the gesture animation process. After discussing related
work in the next section, we describe a production
model that employs natural mechanisms of crossmodal
adaptation to incrementally create fluent and coherent
utterances of multiple verbal and gestural parts. The
model combines a system for synthesizing accen-
ted speech with a hierarchical approach to planning
and controlling upper-limb movements of an articu-
lated figure, the high-level planning stages being
******************************************************************************************************
Copyright # 2004 John Wiley & Sons, Ltd.
*Correspondence to: Stefan Kopp, Artificial Intelligence Group,
Faculty of Technology, University of Bielefeld, D-33594
Bielefeld, Germany.
E-mail: skopp@techfak.uni-bielefeld.de

described by Kopp and Wachsmuth.
7
Then, we focus on
the problem of creating adequate gesture animations in
real time and present a kinematic approach that empha-
sizes the accurate and reliable reproduction of given
spatio-temporal gesture features.
Related Work
In current conversational agents, co-verbal gestures are
usually created for the rhematic elements in speech by
mapping communicative acts onto gestural behaviours
drawn from static libraries.
2,3
Due to the demand for
realism and real-time capability, such behaviours are
associated with animations that are either captured
from real humans or manually predefined to a large
extent, sometimes being parameterizable or combinable
to more complex movements. In the Animated Conver-
sation
8
and REA system
2
as well as in the recent BEAT
system,
4
Cassell et al. succeeded in predicting the timing
of gesture animations from synthesized speech such
that the expressive phase coincides with the most pro-
minent syllable in speech. Yet, the employed techniques
suffer from limited flexibility when it comes to adjusting
a gesture’s timing accordingly,
2,9
or concatenating them
to continuous motion. Cassell
10
states that the problem
of creating gesture animations and synchronizing them
with speech has not been solved so far, ‘due in part to
the difficulty of reconciling the demands of graphics
and speech synthesis software’ (p. 16). This can be
ascribed, first, to the lack of sufficient means of mod-
ulating, e.g. shrinking or stretching, single gesture
phases
4
and, secondly, to a behavior execution that
runs ‘ballistical’, i.e. without the possibility to exert
influence, in an animation system whose reliability is
sometimes hard to predict.
A fully automatic creation of upper limb movements
by means of applying control models was targeted by
only few researchers. Koga et al.
11
proposed a purely
kinematic model for simulating pre-planned arm move-
ments for grasping and manipulating objects. In parti-
cular, this work succeeded in applying findings from
neurophysiology to create natural arm postures. Ap-
proaches based on control algorithms in dynamic simu-
lations or optimization criteria provide a high level of
control and may lead to physically realistic movements.
However, these techniques suffer from difficulties
in formulating control schemes for highly articulated
Figure 1. Multimodal interaction with Max.
S. KOPP AND I. WACHSMUTH
******************************************************************************************************
******************************************************************************************************
Copyright # 2004 John Wiley & Sons, Ltd. 40 Comp. Anim. Virtual Worlds 2004; 15: 39–52

figures and immense computational cost. Mataric
´
et al.
12
stress the problem of determining appropriate control
strategies and propose the combined application of
different controllers for simulating natural upper limb
movements. Gibet et al.
13
apply generic error-correcting
controllers for generating sign language from script-like
specifications. Their models succeeded in simulating
natural movement characteristics to some extent but
did not focus on how to meet various timing constraints
as required in co-verbal gesture.
In summary, the problem of synchronizing gesture
animations with spoken utterances has not been solved
beyond bringing single points in more or less atomic
behaviours to coincidence. The current state is particu-
larly insufficient for virtual agents that shall be able to
produce more extensive, coherent multimodal utter-
ances in a smooth and lifelike fashion.
An I ncremental M odel of
Speech^Gesture Produ cti on
Our approach to synthesizing multimodal utterances
starts from straightforward descriptions of their desired
outer form, which are supposed to be generated at
higher levels of utterance planning and to be specified
in MURML, an XML-based representation language.
6
Such descriptions contain the verbal utterance, augmen-
ted with co-verbal gesturesexplicitly stated in terms
of form featuresby defining only their affiliation to
certain linguistic elements. An example is shown in
Figure 2. Taking MURML specifications as input, our
production model aims at creating synchronized verbal
and non-verbal behaviours in a human-like flow of
multimodal behaviour.
The Segmentation Hypothesis
In order to organize the production of gesture and speech
over multiple sequential behaviours, we adopt an em-
pirically suggested assumption
5
as a segmentation hypoth-
esis: continuous speech and gesture are co-produced in
successive segments each expressing a single idea unit.
The inherent segmentation of speech–gesture production
in humans is reflected in the hierarchical structures of
overt gesture and speech and their cross-modal corre-
spondences.
5,14
.Kendon
14
defined units of gestural
movement to consist of gesture phrases which comprise
one or more subsequent movement phases, notably pre-
paration, stroke (the expressive phase), retraction and holds.
Similarly, the phonological structure of connected speech
in intonation languages such as English and German is
organized over intonation phrases.
15
Such phrases are
separated by significant pauses, they follow the syntac-
tical phrase structure, and display a meaningful pitch
contour with exactly one primary pitch accent (nucleus).
We define chunks of speech–gesture production to
be pairs of an intonation phrase and a co-expressive
gesture phrase, i.e. complex utterances with multiple
gestures are considered to consist of several chunks.
Within each chunk, the prominent concept is concertedly
conveyed by a gesture and an affiliated word or sub-
phrase (in short, affiliate). The coexpressivity is evidenced
by a general temporal synchrony: gestural movements are
timed such that the meaning-bearing stroke phase starts
before the affiliate and frequently spans it, optionally by
inserting dedicated hold phases in the flow of movement.
This coupling is refined if one of the affiliated words is
prosodically focused, e.g. for emphasizing or contrasting
purposes, and hence carries the nucleus of the phrase. In
this case, the gesture stroke starts with the nucleus at the
latest and is not nished before it.
5,16,17
M echan i s m s of C r oss-Moda l
Coord inati on
In humans, the synchrony of gesture and speech is
accomplished by means of cross-modal adaptation.
The segmentation hypothesis enables us to treat the
effective mechanisms on different levels of the utterance
and to organize the overall production process in stages.
Pr oduc ing a Chunk. Within a chunk the synchrony
between the affiliate (or nucleus) and the stroke is
mainly accomplished by the gesture adapting to the
structure and timing of running speech. In producing a
single chunk, the intonation phrase can therefore
be synthesized in advance, potentially augmented
with a strong pitch accent for narrow focus. As in
previous systems (e.g. BEAT
4
), absolute time informa-
tion at the phoneme level is then employed to set up
timing constraints for co-verbal gestural or facial beha-
viors. The gesture stroke is either set to precede the
affiliate’s onset by a given offset (per default one sylla-
ble’s approximate duration of 0.3 s) or to start exactly at
the nucleus if a narrow focus has been defined. In any
case, the stroke is set to span the whole affiliate before
retraction starts. This may be achieved for dynamic
strokes with a post-stroke hold or additional repetitions,
both strategies observable in humans.
5
SYNTHESIZING MULTIMODAL UTTERANCES
******************************************************************************************************
******************************************************************************************************
Copyright # 2004 John Wiley & Sons, Ltd. 41 Comp. Anim. Virtual Worlds 2004; 15: 39–52

Combining Successive Chunks. Humans appear to
anticipate the synchrony of the forthcoming affiliate (or
nucleus) and stroke already at the boundary of succes-
sive chunks since it is prepared there in both modalities:
the onset of the gesture phrase co-varies with the posi-
tion of the nucleus, and the onset of the intonation
phrase co-varies with the stroke onset.
5,16,17
In conse-
quence, movement between two strokes depends on the
timing of the successive strokes and may range from the
adoption of intermediate rest positions to direct transi-
tional movements (co-articulation). Likewise, the dura-
tion of the silent pause between the intonation phrases
may vary according to the required duration of gesture
preparation. Simulating these mechanisms is highly
context-dependent for it has to take into account proper-
ties of the subsequent stroke (form, location, timing
constraints) as well as current movement conditions
when the previous chunk can be relieved, i.e. when its
intonation phrase and its gesture stroke are completed.
The Production Process
Our production model combines the aforementioned
coordination mechanisms to create, as seamlessly as
possible, a natural flow of speech and gesture across
successive coherent chunks. To this end, the classical
Figure 2. Sample XML specification of a multimodal utterance.
S. KOPP AND I. WACHSMUTH
******************************************************************************************************
******************************************************************************************************
Copyright # 2004 John Wiley & Sons, Ltd. 42 Comp. Anim. Virtual Worlds 2004; 15: 39–52

two-phase, planning–execution procedure is extended
for each chunk by additional phases in which the
production processes of subsequent chunks can relieve
one another. Each chunk is processed on a separate
blackboard running through a series of processing states
(see Figure 3).
1. InPrep: Separate modules for speech synthesis, high-
level gesture planning and facial animation contribute
to a chunk’s blackboard during the overall planning
process. The text-to-speech system synthesizes the in-
tonation phrase and controls prosodic parameters like
speech rate and intonation to create natural pitch ac-
cents. Concurrently, the gesture planner defines the
expressive gesture phase in terms of movement con-
straints by selecting a lexicalized gesture template in
MURML, allocating body parts, expanding abstract
movements constraints and resolving deictic references
(as described by Kopp and Wachsmuth
7
). At this stage,
connecting effects are created when a subsequent chunk
is anticipated: the pitch level in speech is maintained
and gesture retraction is planned to lead into an interim
rest position. Once timing information about speech has
arrived on the blackboard, the face module prepares a
lip-synchronous speech animation using simple viseme
interpolation, augmented with eyebrow raises on ac-
cented syllables and emotional expression.
2. Pending: Once chunk planning has been completed,
the state is set to Pending.
3. Lurking: A global scheduler monitors the production
processes of successive chunks. If a chunk can be
uttered, i.e. the preceding chunk is Subsiding (see be-
low), the scheduler defines the intra-chunk synchrony
as aforementioned and reconciles it with the onsets of
the intonation and gesture phrases. In case the affiliate is
located early in the intonation phrase, the scheduler lets
the gesture’s preparation precede speech. Due to pre-
defined movement velocitymovement duration is
estimated from its amplitude using a logarithmic law
(see section on ‘Gesture Motor Control’)the vocal
pause between subsequent intonation phrases may
thus be stretched, depending on the time consumption
of the preparation phase (see Figure 3). Besides this
possible adaptation to gesture, intonation phrases are
articulated ballistically as prepared by the text-to-
speech system. Finally, the scheduler passes control
over to the successive chunk.
At this point, the motor layer is responsible for, first,
planning on-the-fly upper-limb animations of the agent
that exactly satisfy the given movement and timing
constraints. Secondly, gesture animations must be
blended autonomously according to the given timing
constraints as well as the current movement conditions.
For example, a gesture whose form properties require
under current movement conditionsa more extensive
preparation has to start earlier to naturally meet the
mandatory time of stroke onset. Since at this point the
preceding gesture may have not been fully retracted,
fluent gesture transitions should emerge depending on
the placement of the affiliate within the verbal phrase
(see Figure 3). We describe such a motor control layer
for Max in the next section.
4–6. InExec, Subsiding, Done: Depending on feedback
information from behaviour executions, which is col-
lected on the blackboard, the chunk state then switches
to InExec. Eventually, once the intonation phrase, the
Figure 3. Incremental production of multimodal chunks.
SYNTHESIZING MULTIMODAL UTTERANCES
******************************************************************************************************
******************************************************************************************************
Copyright # 2004 John Wiley & Sons, Ltd. 43 Comp. Anim. Virtual Worlds 2004; 15: 39–52

Citations
More filters
Journal ArticleDOI

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill as discussed by the authors discusses what Gestures reveal about Thought in Hand and Mind: What Gestures Reveal about Thought. Chicago and London: University of Chicago Press, 1992. 416 pp.
Book ChapterDOI

Towards a common framework for multimodal generation: the behavior markup language

TL;DR: An international effort to unify a multimodal behavior generation framework for Embodied Conversational Agents (ECAs) is described, where the stages represent intent planning, behavior planning and behavior realization is proposed.
Book ChapterDOI

A conversational agent as museum guide: design and evaluation of a real-world application

TL;DR: Results indicate that Max engages people in interactions where they are likely to use human-like communication strategies, suggesting the attribution of sociality to the agent.
Journal Article

Towards a common framework for multimodal generation : The behavior markup language

TL;DR: In this article, the authors propose a three-stage model called SAIBA, where the stages represent intent planning, behavior planning and behavior realization, and a Function Markup Language (FML), describing intent without referring to physical behavior, mediates between the first two stages.
Proceedings ArticleDOI

SmartBody: behavior realization for embodied conversational agents

TL;DR: SmartBody is presented, an open source modular framework for animating ECAs in real time, based on the notion of hierarchically connected animation controllers, which can employ arbitrary animation algorithms such as keyframe interpolation, motion capture or procedural animation.
References
More filters
Book

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill et al. as mentioned in this paper argue that gestures do not simply form a part of what is said and meant but have an impact on thought itself, and that gestures are global, synthetic, idiosyncratic, and imagistic.
Book

Embodied conversational agents

TL;DR: Embodied conversational agents as mentioned in this paper are computer-generated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and nonverbal communication.
Journal ArticleDOI

Hand and Mind: What Gestures Reveal about Thought

TL;DR: McNeill as discussed by the authors discusses what Gestures reveal about Thought in Hand and Mind: What Gestures Reveal about Thought. Chicago and London: University of Chicago Press, 1992. 416 pp.
Proceedings ArticleDOI

BEAT: the Behavior Expression Animation Toolkit

TL;DR: The Behavior Expression Animation Toolkit (BEAT) as discussed by the authors allows animators to input typed text that they wish to be spoken by an animated human figure, and to obtain as output appropriate and synchronized nonverbal behaviors and synthesized speech in a form that can be sent to a number of different animation systems.
Frequently Asked Questions (19)
Q1. What contributions have the authors mentioned in the paper "Synthesizing multimodal utterances for conversational agents" ?

In this paper, the authors focus on the generation of gesture and speech fromXML-based descriptions of their overt form. An incremental production model is presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors with mechanisms for linking them in fluent utterances with natural co-articulation and transition effects. In particular, an efficient kinematic approach for animating hand gestures from shape specifications is presented, which provides fine adaptation to temporal constraints that are imposed by cross-modal synchrony. 

Concerning future work, it appears natural to further exploit the flexibility and generality of their synthesis model for the automatic planning of multimodal utterances of a wide variety. The authors expect this to yield a coordinated accentuation, e. g. according to an underlying rhythmic pulse, and to include the timing of velocity peaks of single movement phases which can be taken into account in their approach. Furthermore, the gesture animation model will be further explored with respect to variations of the parameters, e. g. influencing the relationship between trajectory curvature and velocity. 

The animation of co-verbal gesture requires a high degree of control and flexibility with respect to shape and time properties while at the same time ensuring naturalness of movement. 

To ensure fluent, at least C1-continuous connection to the given boundary conditions, a kinematic feedforward controller cannot be created until the moment of activation of its LMP. 

The resulting synthetic utterances achieve cross-modal synchrony even at the syllable level while reproducing natural co-articulation and transition effects. 

The authors define chunks of speech–gesture production to be pairs of an intonation phrase and a co-expressive gesture phrase, i.e. complex utterances with multiple gestures are considered to consist of several chunks. 

The authors expect this to yield a coordinated accentuation, e.g. according to an underlying rhythmic pulse, and to include the timing of velocity peaks of single movement phases which can be taken into account in their approach. 

An IncrementalModelofTheir approach to synthesizing multimodal utterances starts from straightforward descriptions of their desired outer form, which are supposed to be generated at higher levels of utterance planning and to be specified in MURML, an XML-based representation language. 

the gesture planner defines the expressive gesture phase in terms of movement constraints by selecting a lexicalized gesture template in MURML, allocating body parts, expanding abstract movements constraints and resolving deictic references (as described by Kopp and Wachsmuth7). 

Their approach to forming wrist trajectories relies on the well-known observation that complex arm movements consist of subsequently and ballistically performed segments with the following kinematic regularities of the effector trajectory:22* short targeted segments are straight or curvilinear(either C- or S-shaped) and always planar;* they exhibit a symmetrical bell-shaped velocityprofile;* a quasi-linear relation between amplitude and peakvelocity, as well as an approximate logarithmic relation between amplitude and movement duration, holds; * at any point except points of extreme bending, themovement speed can be estimated from the radius r of the trajectory by the ‘law of 23’: ¼ k r 1 3, where k is a constant velocity gain factor for each segment and assumed to be a parameter of motor control. 

As described in the previous section, the motor control layer of Max is in charge of autonomously creating context-dependent gesture transitions. 

For the shoulder and wrist joint, the authors apply the approach by Wilhelms and Van Gelder18 to define the joint limits geometrically in terms of reach cones with varying twist limits. 

Cassell10 states that the problem of creating gesture animations and synchronizing them with speech has not been solved so far, ‘due in part to the difficulty of reconciling the demands of graphics and speech synthesis software’ (p. 16). 

Such agents are envisioned to have similar properties to humans in face-to-face communication, including the ability to generate simultaneous verbal and non-verbal behaviours. 

This can be ascribed, first, to the lack of sufficient means of modulating, e.g. shrinking or stretching, single gesture phases4 and, secondly, to a behavior execution that runs ‘ballistical’, i.e. without the possibility to exert influence, in an animation system whose reliability is sometimes hard to predict. 

The inherent segmentation of speech–gesture production in humans is reflected in the hierarchical structures of overt gesture and speech and their cross-modal correspondences. 

If a chunk can be uttered, i.e. the preceding chunk is Subsiding (see below), the scheduler defines the intra-chunk synchronyas aforementioned and reconciles it with the onsets of the intonation and gesture phrases. 

To recombine the LMPs for a solution to the overall control problem, LMPs run concurrently and synchronized in an abstract motor control program (MCP) for each limb’s motion (see Figure 4). 

At this point, the motor layer is responsible for, first, planning on-the-fly upper-limb animations of the agent that exactly satisfy the given movement and timing constraints.