What contributions have the authors mentioned in the paper "Synthesizing multimodal utterances for conversational agents" ?

In this paper, the authors focus on the generation of gesture and speech fromXML-based descriptions of their overt form. An incremental production model is presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors with mechanisms for linking them in fluent utterances with natural co-articulation and transition effects. In particular, an efficient kinematic approach for animating hand gestures from shape specifications is presented, which provides fine adaptation to temporal constraints that are imposed by cross-modal synchrony.

What are the future works in "Synthesizing multimodal utterances for conversational agents" ?

Concerning future work, it appears natural to further exploit the flexibility and generality of their synthesis model for the automatic planning of multimodal utterances of a wide variety. The authors expect this to yield a coordinated accentuation, e. g. according to an underlying rhythmic pulse, and to include the timing of velocity peaks of single movement phases which can be taken into account in their approach. Furthermore, the gesture animation model will be further explored with respect to variations of the parameters, e. g. influencing the relationship between trajectory curvature and velocity.

What is the way to make a co-verbal gesture?

The animation of co-verbal gesture requires a high degree of control and flexibility with respect to shape and time properties while at the same time ensuring naturalness of movement.

How can a kinematic feedforward controller be created?

To ensure fluent, at least C1-continuous connection to the given boundary conditions, a kinematic feedforward controller cannot be created until the moment of activation of its LMP.

What is the effect of the akinematic model on the utterances?

The resulting synthetic utterances achieve cross-modal synchrony even at the syllable level while reproducing natural co-articulation and transition effects.

How do the authors expect this to be done?

The authors expect this to yield a coordinated accentuation, e.g. according to an underlying rhythmic pulse, and to include the timing of velocity peaks of single movement phases which can be taken into account in their approach.

What is the kinematic relationship between amplitude and peak velocity?

Their approach to forming wrist trajectories relies on the well-known observation that complex arm movements consist of subsequently and ballistically performed segments with the following kinematic regularities of the effector trajectory:22* short targeted segments are straight or curvilinear(either C- or S-shaped) and always planar;* they exhibit a symmetrical bell-shaped velocityprofile;* a quasi-linear relation between amplitude and peakvelocity, as well as an approximate logarithmic relation between amplitude and movement duration, holds; * at any point except points of extreme bending, themovement speed can be estimated from the radius r of the trajectory by the ‘law of 23’: ¼ k r 1 3, where k is a constant velocity gain factor for each segment and assumed to be a parameter of motor control.

What is the role of the motor control layer of Max?

As described in the previous section, the motor control layer of Max is in charge of autonomously creating context-dependent gesture transitions.

What is the kinematic approach for the shoulder and wrist joint?

For the shoulder and wrist joint, the authors apply the approach by Wilhelms and Van Gelder18 to define the joint limits geometrically in terms of reach cones with varying twist limits.

how are the structures of overt gesture and speech in humans?

The inherent segmentation of speech–gesture production in humans is reflected in the hierarchical structures of overt gesture and speech and their cross-modal correspondences.

What is the state of the chunk?

If a chunk can be uttered, i.e. the preceding chunk is Subsiding (see below), the scheduler defines the intra-chunk synchronyas aforementioned and reconciles it with the onsets of the intonation and gesture phrases.

What is the simplest way to create a synchronized LMP?

To recombine the LMPs for a solution to the overall control problem, LMPs run concurrently and synchronized in an abstract motor control program (MCP) for each limb’s motion (see Figure 4).

What is the role of the motor layer in the lurking?

At this point, the motor layer is responsible for, first, planning on-the-fly upper-limb animations of the agent that exactly satisfy the given movement and timing constraints.

(Open Access) Synthesizing multimodal utterances for conversational agents (2004) | Stefan Kopp

Q: how many chunks of speech are considered to be produced?

The authors define chunks of speech–gesture production to be pairs of an intonation phrase and a co-expressive gesture phrase, i.e. complex utterances with multiple gestures are considered to consist of several chunks.

Q: What is the main problem of the synthesis of multimodal utterances?

An IncrementalModelofTheir approach to synthesizing multimodal utterances starts from straightforward descriptions of their desired outer form, which are supposed to be generated at higher levels of utterance planning and to be specified in MURML, an XML-based representation language.

Q: What is the simplest way to create a natural flow of speech and gesture?

the gesture planner defines the expressive gesture phase in terms of movement constraints by selecting a lexicalized gesture template in MURML, allocating body parts, expanding abstract movements constraints and resolving deictic references (as described by Kopp and Wachsmuth7).

COMPUTER ANIMATION AND VIRTUAL WORLDS

Comp. Anim. Virtual Worlds 2004; 15: 39–52 (DOI: 10.1002/cav.6)

******************************************************************************************************

Synthesizing multimodal utterances for

conversational agents

By Stefan Kopp* and Ipke Wachsmuth

******************************** *** **************** *** **************** ******* *** ****

Conversational agents are supposed to combine speech with non-verbal modalities for

intelligible multimodal utterances. In this paper, we focus on the generation of gesture and

speech from XML-based descriptions of their overt form. An incremental production model is

presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors

with mechanisms for linking them in ﬂuent utterances with natural co-articulation and

transition effects. In particular, an efﬁcient kinematic approach for animating hand gestures

from shape speciﬁcations is presented, which provides ﬁne adaptation to temporal

constraints that are imposed by cross-modal synchrony. Copyright # 2004 John Wiley &

Sons, Ltd.

Received: 7 March 2003; Revised: 30 June 2003

KEY WORDS: multimodal conversational agents; gesture animation; model-based computer

animation; motion control

Introduction

Techniques from artiﬁcial intelligence, computer anima-

tion, and human–computer interaction are increasingly

converging in the ﬁeld of embodied conversational

agents.

Such agents are envisioned to have similar

properties to humans in face-to-face communication,

including the ability to generate simultaneous verbal

and non-verbal behaviours. This includes co-verbal

gestures that humans frequently produce during speech

to emphasize, clarify or even complement the convey-

ance of central parts of an utterance.

Current conversational agents, e.g. the real estate agent

REA,

the pedagogical agent Steve

or in the BEAT

system,

generate their multimodal utterances by, ﬁrst,

planning the communicative acts to be performed and,

secondly, synthesizing verbal and non-verbal behaviors.

The latter stage involves the generation of appropriate,

intelligible verbal and gestural acts per se as well as their

combination in a seamless, human-like ﬂow of multi-

modal behaviour. At the same time, verbal and non-

verbal behaviours have to be ﬁnely synchronized at

distinct points of time to ensure coherence of the result-

ing utterances. For example, the co-expressive elements

in speech and co-verbal gesture appear in semantically

and pragmatically coordinated form

and—vitally

important—in temporal synchrony even at the level of

single syllables. Meeting these demands for synchrony,

continuity, and lifelikeness simultaneously poses contin-

uous problems for the automatic synthesis of multimodal

utterances in conversational agents.

In our lab, the anthropomorphic agent Max is under

development. Max acts as an assembly expert in an

immersive 3D virtual environment (see Figure 1 for

the overall scenario). The agent demonstrates assembly

procedures to the user by combining facial and upper

limb gestures with spoken utterances. An important

aspect in synthesizing Max’s utterances is the real-

time creation of synchronized gestural and verbal beha-

viors from application-independent descriptions of

their outer form. Such descriptions are supposed to be

created during high-level utterance planning and to be

speciﬁed in MURML, an XML-based representation

language.

In this paper, we present the utterance

production model employed in Max with a focus on

the gesture animation process. After discussing related

work in the next section, we describe a production

model that employs natural mechanisms of crossmodal

adaptation to incrementally create ﬂuent and coherent

utterances of multiple verbal and gestural parts. The

model combines a system for synthesizing accen-

ted speech with a hierarchical approach to planning

and controlling upper-limb movements of an articu-

lated ﬁgure, the high-level planning stages being

******************************************************************************************************

*Correspondence to: Stefan Kopp, Artiﬁcial Intelligence Group,

Faculty of Technology, University of Bielefeld, D-33594

Bielefeld, Germany.

E-mail: skopp@techfak.uni-bielefeld.de

described by Kopp and Wachsmuth.

Then, we focus on

the problem of creating adequate gesture animations in

real time and present a kinematic approach that empha-

sizes the accurate and reliable reproduction of given

spatio-temporal gesture features.

Related Work

In current conversational agents, co-verbal gestures are

usually created for the rhematic elements in speech by

mapping communicative acts onto gestural behaviours

drawn from static libraries.

2,3

Due to the demand for

realism and real-time capability, such behaviours are

associated with animations that are either captured

from real humans or manually predeﬁned to a large

extent, sometimes being parameterizable or combinable

to more complex movements. In the Animated Conver-

sation

and REA system

as well as in the recent BEAT

system,

Cassell et al. succeeded in predicting the timing

of gesture animations from synthesized speech such

that the expressive phase coincides with the most pro-

minent syllable in speech. Yet, the employed techniques

suffer from limited ﬂexibility when it comes to adjusting

a gesture’s timing accordingly,

2,9

or concatenating them

to continuous motion. Cassell

states that the problem

of creating gesture animations and synchronizing them

with speech has not been solved so far, ‘due in part to

the difﬁculty of reconciling the demands of graphics

and speech synthesis software’ (p. 16). This can be

ascribed, ﬁrst, to the lack of sufﬁcient means of mod-

ulating, e.g. shrinking or stretching, single gesture

phases

and, secondly, to a behavior execution that

runs ‘ballistical’, i.e. without the possibility to exert

inﬂuence, in an animation system whose reliability is

sometimes hard to predict.

A fully automatic creation of upper limb movements

by means of applying control models was targeted by

only few researchers. Koga et al.

proposed a purely

kinematic model for simulating pre-planned arm move-

ments for grasping and manipulating objects. In parti-

cular, this work succeeded in applying ﬁndings from

neurophysiology to create natural arm postures. Ap-

proaches based on control algorithms in dynamic simu-

lations or optimization criteria provide a high level of

control and may lead to physically realistic movements.

However, these techniques suffer from difﬁculties

in formulating control schemes for highly articulated

Figure 1. Multimodal interaction with Max.

S. KOPP AND I. WACHSMUTH

******************************************************************************************************

ﬁgures and immense computational cost. Mataric

et al.

stress the problem of determining appropriate control

strategies and propose the combined application of

different controllers for simulating natural upper limb

movements. Gibet et al.

apply generic error-correcting

controllers for generating sign language from script-like

speciﬁcations. Their models succeeded in simulating

natural movement characteristics to some extent but

did not focus on how to meet various timing constraints

as required in co-verbal gesture.

In summary, the problem of synchronizing gesture

animations with spoken utterances has not been solved

beyond bringing single points in more or less atomic

behaviours to coincidence. The current state is particu-

larly insufﬁcient for virtual agents that shall be able to

produce more extensive, coherent multimodal utter-

ances in a smooth and lifelike fashion.

An I ncremental M odel of

Speech^Gesture Produ cti on

Our approach to synthesizing multimodal utterances

starts from straightforward descriptions of their desired

outer form, which are supposed to be generated at

higher levels of utterance planning and to be speciﬁed

in MURML, an XML-based representation language.

Such descriptions contain the verbal utterance, augmen-

ted with co-verbal gestures—explicitly stated in terms

of form features—by deﬁning only their afﬁliation to

certain linguistic elements. An example is shown in

Figure 2. Taking MURML speciﬁcations as input, our

production model aims at creating synchronized verbal

and non-verbal behaviours in a human-like ﬂow of

multimodal behaviour.

The Segmentation Hypothesis

In order to organize the production of gesture and speech

over multiple sequential behaviours, we adopt an em-

pirically suggested assumption

as a segmentation hypoth-

esis: continuous speech and gesture are co-produced in

successive segments each expressing a single idea unit.

The inherent segmentation of speech–gesture production

in humans is reﬂected in the hierarchical structures of

overt gesture and speech and their cross-modal corre-

spondences.

5,14

.Kendon

deﬁned units of gestural

movement to consist of gesture phrases which comprise

one or more subsequent movement phases, notably pre-

paration, stroke (the expressive phase), retraction and holds.

Similarly, the phonological structure of connected speech

in intonation languages such as English and German is

organized over intonation phrases.

Such phrases are

separated by signiﬁcant pauses, they follow the syntac-

tical phrase structure, and display a meaningful pitch

contour with exactly one primary pitch accent (nucleus).

We deﬁne chunks of speech–gesture production to

be pairs of an intonation phrase and a co-expressive

gesture phrase, i.e. complex utterances with multiple

gestures are considered to consist of several chunks.

Within each chunk, the prominent concept is concertedly

conveyed by a gesture and an afﬁliated word or sub-

phrase (in short, afﬁliate). The coexpressivity is evidenced

by a general temporal synchrony: gestural movements are

timed such that the meaning-bearing stroke phase starts

before the afﬁliate and frequently spans it, optionally by

inserting dedicated hold phases in the ﬂow of movement.

This coupling is reﬁned if one of the afﬁliated words is

prosodically focused, e.g. for emphasizing or contrasting

purposes, and hence carries the nucleus of the phrase. In

this case, the gesture stroke starts with the nucleus at the

latest and is not ﬁnished before it.

5,16,17

M echan i s m s of C r oss-Moda l

Coord inati on

In humans, the synchrony of gesture and speech is

accomplished by means of cross-modal adaptation.

The segmentation hypothesis enables us to treat the

effective mechanisms on different levels of the utterance

and to organize the overall production process in stages.

Pr oduc ing a Chunk. Within a chunk the synchrony

between the afﬁliate (or nucleus) and the stroke is

mainly accomplished by the gesture adapting to the

structure and timing of running speech. In producing a

single chunk, the intonation phrase can therefore

be synthesized in advance, potentially augmented

with a strong pitch accent for narrow focus. As in

previous systems (e.g. BEAT

), absolute time informa-

tion at the phoneme level is then employed to set up

timing constraints for co-verbal gestural or facial beha-

viors. The gesture stroke is either set to precede the

afﬁliate’s onset by a given offset (per default one sylla-

ble’s approximate duration of 0.3 s) or to start exactly at

the nucleus if a narrow focus has been deﬁned. In any

case, the stroke is set to span the whole afﬁliate before

retraction starts. This may be achieved for dynamic

strokes with a post-stroke hold or additional repetitions,

both strategies observable in humans.

SYNTHESIZING MULTIMODAL UTTERANCES

******************************************************************************************************

Combining Successive Chunks. Humans appear to

anticipate the synchrony of the forthcoming afﬁliate (or

nucleus) and stroke already at the boundary of succes-

sive chunks since it is prepared there in both modalities:

the onset of the gesture phrase co-varies with the posi-

tion of the nucleus, and the onset of the intonation

phrase co-varies with the stroke onset.

5,16,17

In conse-

quence, movement between two strokes depends on the

timing of the successive strokes and may range from the

adoption of intermediate rest positions to direct transi-

tional movements (co-articulation). Likewise, the dura-

tion of the silent pause between the intonation phrases

may vary according to the required duration of gesture

preparation. Simulating these mechanisms is highly

context-dependent for it has to take into account proper-

ties of the subsequent stroke (form, location, timing

constraints) as well as current movement conditions

when the previous chunk can be relieved, i.e. when its

intonation phrase and its gesture stroke are completed.

The Production Process

Our production model combines the aforementioned

coordination mechanisms to create, as seamlessly as

possible, a natural ﬂow of speech and gesture across

successive coherent chunks. To this end, the classical

Figure 2. Sample XML speciﬁcation of a multimodal utterance.

S. KOPP AND I. WACHSMUTH

******************************************************************************************************

two-phase, planning–execution procedure is extended

for each chunk by additional phases in which the

production processes of subsequent chunks can relieve

one another. Each chunk is processed on a separate

blackboard running through a series of processing states

(see Figure 3).

1. InPrep: Separate modules for speech synthesis, high-

level gesture planning and facial animation contribute

to a chunk’s blackboard during the overall planning

process. The text-to-speech system synthesizes the in-

tonation phrase and controls prosodic parameters like

speech rate and intonation to create natural pitch ac-

cents. Concurrently, the gesture planner deﬁnes the

expressive gesture phase in terms of movement con-

straints by selecting a lexicalized gesture template in

MURML, allocating body parts, expanding abstract

movements constraints and resolving deictic references

(as described by Kopp and Wachsmuth

). At this stage,

connecting effects are created when a subsequent chunk

is anticipated: the pitch level in speech is maintained

and gesture retraction is planned to lead into an interim

rest position. Once timing information about speech has

arrived on the blackboard, the face module prepares a

lip-synchronous speech animation using simple viseme

interpolation, augmented with eyebrow raises on ac-

cented syllables and emotional expression.

2. Pending: Once chunk planning has been completed,

the state is set to Pending.

3. Lurking: A global scheduler monitors the production

processes of successive chunks. If a chunk can be

uttered, i.e. the preceding chunk is Subsiding (see be-

low), the scheduler deﬁnes the intra-chunk synchrony

as aforementioned and reconciles it with the onsets of

the intonation and gesture phrases. In case the afﬁliate is

located early in the intonation phrase, the scheduler lets

the gesture’s preparation precede speech. Due to pre-

deﬁned movement velocity—movement duration is

estimated from its amplitude using a logarithmic law

(see section on ‘Gesture Motor Control’)—the vocal

pause between subsequent intonation phrases may

thus be stretched, depending on the time consumption

of the preparation phase (see Figure 3). Besides this

possible adaptation to gesture, intonation phrases are

articulated ballistically as prepared by the text-to-

speech system. Finally, the scheduler passes control

over to the successive chunk.

At this point, the motor layer is responsible for, ﬁrst,

planning on-the-ﬂy upper-limb animations of the agent

that exactly satisfy the given movement and timing

constraints. Secondly, gesture animations must be

blended autonomously according to the given timing

constraints as well as the current movement conditions.

For example, a gesture whose form properties require—

under current movement conditions—a more extensive

preparation has to start earlier to naturally meet the

mandatory time of stroke onset. Since at this point the

preceding gesture may have not been fully retracted,

ﬂuent gesture transitions should emerge depending on

the placement of the afﬁliate within the verbal phrase

(see Figure 3). We describe such a motor control layer

for Max in the next section.

4–6. InExec, Subsiding, Done: Depending on feedback

information from behaviour executions, which is col-

lected on the blackboard, the chunk state then switches

to InExec. Eventually, once the intonation phrase, the

Figure 3. Incremental production of multimodal chunks.

SYNTHESIZING MULTIMODAL UTTERANCES

******************************************************************************************************

Synthesizing multimodal utterances for conversational agents

Figures

Citations

Hand and Mind: What Gestures Reveal about Thought

Towards a common framework for multimodal generation: the behavior markup language

A conversational agent as museum guide: design and evaluation of a real-world application

Towards a common framework for multimodal generation : The behavior markup language

SmartBody: behavior realization for embodied conversational agents

References

Hand and Mind: What Gestures Reveal about Thought

Embodied conversational agents

Hand and Mind: What Gestures Reveal about Thought

BEAT: the Behavior Expression Animation Toolkit

Gesticulation and Speech: Two Aspects of the Process of Utterance

Related Papers (5)

BEAT: the Behavior Expression Animation Toolkit

Hand and Mind: What Gestures Reveal about Thought

Gesture: Visible Action as Utterance

Towards a common framework for multimodal generation: the behavior markup language

Embodied conversational agents

Frequently Asked Questions (19)

Q1. What contributions have the authors mentioned in the paper "Synthesizing multimodal utterances for conversational agents" ?

Q2. What are the future works in "Synthesizing multimodal utterances for conversational agents" ?

Q3. What is the way to make a co-verbal gesture?

Q4. How can a kinematic feedforward controller be created?

Q5. What is the effect of the akinematic model on the utterances?

Q6. how many chunks of speech are considered to be produced?

Q7. How do the authors expect this to be done?

Q8. What is the main problem of the synthesis of multimodal utterances?

Q9. What is the simplest way to create a natural flow of speech and gesture?

Q10. What is the kinematic relationship between amplitude and peak velocity?

Q11. What is the role of the motor control layer of Max?

Q12. What is the kinematic approach for the shoulder and wrist joint?

Q13. Why does Cassell say that the problem of creating gesture animations has not been solved so?

Q14. What are the properties of embodied conversational agents?

Q15. Why is the problem of creating gesture animations and synchronizing them with speech difficult to predict?

Q16. how are the structures of overt gesture and speech in humans?

Q17. What is the state of the chunk?

Q18. What is the simplest way to create a synchronized LMP?

Q19. What is the role of the motor layer in the lurking?