Gestures and Phases: The Dynamics of Speech-Hand Communication
Authors: Paul Treffner a; Mira Peter a; Mark Kleidon b
Publication Frequency: 4 issues per year
We investigated how a listener's perceived meaning of a spoken sentence is influenced by the relative timing between a speaker's speech and accompanying hand gestures. Participants viewed a computer-animated character who uttered the phrase, “Put the book there now.” while executing a simple right-handed beat gesture whose location relative to the utterance was precisely controlled in a frame-by-frame fashion. The participant's task consisted of making a judgment about two related aspects of the actor's perceived speech: (a) Which word was emphasized? and (b) How clear was the emphasis? That is, did it make sense? The results revealed that the perceived emphasis was determined by the timing (phasing) of the speaker's hand gesture. Furthermore, the clarity of the perceived emphasis (i.e., meaningfulness) was influenced by the affordances in the immediate environment of the speaker. Discussion addresses the primacy of ostensive specification and gesture in communicative events, the dynamics of speech-hand coordination during both actual and virtual dialogue, and the role of environmental affordances in grounding informative communicative acts in the ecology of organism-environment dynamics.
The question of the extent to which hand gestures (and other movements sometimes referred to as “body language”) might provide a basis for human communication continues to challenge theories of language and language development. The fact that most of us move our hands in spontaneous gesture even if we cannot be seen, for example when talking to a blind person or to someone by telephone (Iverson & Goldin-Meadow, 1997), suggests that a speaker's accompanying hand motion may play a role in helping the listener to directly perceive—or “grasp”—the meaning of an utterance. Why might hand and body movement play such a central role in language? While some researchers believe that manual gestures preceded speech and that a relatively sharp transition from a predominantly gestural to a predominantly vocal form of language occurred as a consequence of a speciation event (Corballis, 2003), others maintain that the process whereby human communication altered from monkey-like actions towards affordances (action-related properties of the environment relevant to action; Gibson, 1979/1986) to a predominantly vocal process was relatively slow and cumulative (Arbib, 2005). However, the observation that coherent, fluent speech is usually accompanied by similarly coherent gestures (Blake, Olshansky, Vitale, & Macdonald, 1997) has led to the proposal that speech and hand gestures may have necessarily co-evolved in a reciprocal manner, with speech and gesture manifesting as complementary aspects of a single production-perception mechanism rather than with gestures preceding speech or vice versa (e.g., McNeill, 2000, 2005).
The reason for the tight temporal coupling between speech and gestures in everyday communication may stem from the common source from which both gestures arose, that is, a general perceptual sensitivity to specific rhythmic patterns (McNeill, 1992). It has been hypothesized that the “ba-ba” babbling of infants (with onset around 7 months) may be a nonverbal motor activity related to the emergent control over mouth and jaw, or a linguistic activity reflecting early sensitivity to phonetic-syllabic patterns (Pettito, Holowka, Sergio, Levy, & Ostry, 2004). Drawing on the classic work by Kendon (1972), McNeill has long argued that there is an extremely close synchrony between gesture and speech such that the two operate as an inseparable, coherent unit that embodies the language production process itself rather than just reflecting different outputs from it (McNeill, 1992, 2000; see also Furuyama, 2002). Thus, words and gestures are not just expressions of thought but instead such acts constitute the thinking process itself. Evidence of this tight synchrony includes the fact that disrupting speech also disrupts gestures and vice versa and that stutterers modify their gestures to maintain synchrony with speech (Mayberry & Jaques, 2000; Mayberry & Shenker, 1997; see also van Lieshout, 2004, and van Lieshout, Hulstijn, & Peters, 1996). At the more macroscopic level of collectives or groups of gestures and their relation to explanatory speech, appropriate coordination is also crucial. The “mismatch” of information expressed in speech and that expressed via gesture can, surprisingly, facilitate learning. For example, children (especially those who were in a transitional phase of comprehending) learned more effectively when a teacher attempted to explain the mathematical concept of “equivalence” through simultaneously offering complementary but not identical information in speech and hand gesture (Goldin-Meadow, 1999; Singer & Goldin-Meadow, 2005). Likewise, research on language acquisition indicates that gesture plays a key role in a child's learning, for example, in learning to count (Alibali & DiRusso, 1999; Mayberry & Nicholadis, 2000).
From the perspective of ecological psychology, in order to satisfy the most basic theoretical demands of logical consistency and empirical relevance, communicative behavior and its corresponding semantics must be “bound to” or “grounded within” an organism-relevant, action-oriented environmental context (Shaw, Turvey, & Mace, 1981; Turvey, Shaw, Reed, & Mace, 1981). Ostensive specification (e.g., pointing with the finger) is the most basic act of making another human aware of one's intentions via a communicative technique that makes direct epistemic contact with the environment—encountering without “physically” touching—but resulting in knowledge of comparable reliability (Reed, 1996; Shaw, 2001, 2003; Shaw et al., 1981). Ostensive specification in the form of pointing typically has a standardized form within a given culture (McNeill, 2000). Although the Western standard form for pointing is with the index finger extended and the other fingers curled in, some cultures use two fingers or the entire hand; such forms of pointing are still recognized in Western cultures. As a form of ostensive specification, pointing (and its meaning) can be fully perceived without accompanying speech. Pointing can provide the somatic basis for a perceiver to apprehend (literally, “grasp”) a speaker's intended meaning. That is, a listener may directly perceive (understand) what a speaker means largely because the listener has an inherent haptic awareness of his or her own limbs and extremities, specifically their position and orientation (Pagano & Turvey, 1995), as well as, it is important to note, their dynamics and phase relations (Wilson, Bingham, & Craig, 2003). At stake is an appreciation of the possible role that gesture plays in grounding communication by affirming the primacy of direct perception—not just perception of an object, environment, or situation—but perception of an object, environment, or situation as indicated to a perceiver by a speaker during a communicative speech act (Shaw et al., 1981). Ostensive specification can turn a communicative act from semantically opaque (i.e., meaningless) into a semantically transparent and clearly understandable utterance. In such specifying acts, an object or event, or more precisely, a “complex particular,” is specified by the speaker and consists of the coordination of agent, situation, and occasion as a self-presenting (i.e., represents itself) fact rather than as a symbolic representation (i.e., represents something other than itself). Said differently, it is the force of existence (knowledge from acquaintance) rather than the force of argument (knowledge from description) that underlies even the possibility (and certainly the felicity) of communicative acts between a speaker-actor and a perceiver-listener (Shaw et al., 1981; see especially pp.199-209). Indeed, the power of ostensive specification is such that even nonhuman species apparently understand (or can directly perceive) the meaning of a human's gestural command, such as when a dog's owner gestures to “go fetch”—not only to go fetch an object but even to go to a particular location if several fetchables are available.
However, pointing is by no means the only form of manual gesture that humans exhibit. Pointing is one component of a wider class of gestures known as gesticulation, which can be subdivided into iconic, metaphoric, deictic, and beat gestures (Kendon, 1974). Both iconic and metaphoric gestures are considered “pictorial” gestures in that the speaker uses the gesture to help convey awareness of the shape or style of the object or abstract concept being referred to (e.g., the shape of a spiral staircase or the idea of openness). Pointing gestures are also known as deictic gestures and often clearly specify a referent in the environment or context of the discourse. A beat gesture or simply, beat, is a short, rhythmic movement or series of movements of the hand or hands that can gauge and influence both the overall rate at which a discourse occurs, the transitions between sections and topics of the discourse, as well as more subtle aspects of emphasis and semantics. In this research we are primarily concerned with beats and the extent to which they may also play a dual “deictic role” that is dependent on both their temporal relation to speech and on the environmental context within which they occur.
Numerous findings support the idea that manual gesture and speech execution are tightly coupled, both in fluent and in stuttered speech (Mayberry & Jaques, 2000). Additionally, it has been observed that the majority of gesture strokes (the most meaningful and effortful part of the gesture) occur just before or during the speaker's articulation of the most contrastively stressed syllable (Kendon, 1974; McNeill, 1985, 1992), whereas the preparation phase often anticipates speech (McNeill, 1992). Similarly, Nobe (2000) showed that most representational gestures had their onset during speech articulation. However, it was observed that in 10% of cases speech and the accompanying hand gesture were actually asynchronous—the onset of the hand gesture preceded the co-expressive speech by at least 250 ms, which roughly corresponds to the duration of a syllable (Nobe, 2000). Similarly, it was shown that gesture onset and speech production were synchronized within a few milliseconds of one another although the hand gestures slightly preceded speech (Morrel-Samuels & Krauss, 1992).
It has been suggested that beat gestures are used to emphasize a particular word or phrase and usually correspond temporally to a single syllable. It has also been proposed that they are used as a rhythmic marker that can guide organization of prosodic phrases (McNeill, 1992). However, variability has been observed in the timing between stressed syllables and accompanying beat gesture strokes. While the synchronization between deictic gestures and speech has been shown to be clear, it has been argued that gestural beats and verbal stress are not synchronized in such a strict rhythmic manner (McClave, 1994). As beat and deictic gestures may be considered the most primitive type of communicative movements, and because beats are relatively conspicuous and easy to analyze (e.g., a simple up-down or left-right movement of the hand or forearm), it is instructive to focus on this form of linguistic action before addressing more complex phenomena such as the iconic or metaphoric gestures found in everyday conversation.
Although beat gestures usually occur naturally during speech, on some occasions there may be little or no perceptible movement of the hands during conversation, such as when the hands are physically restricted when employed for another activity or when the hands of the speaker are not visible such as during a phone conversation. How then does this affect clarity of linguistic perception during communication? In driving a car, it has been shown that when the driver's conversation is directed outside of the immediate environment of the vehicle—as when using a hands-free mobile phone—coordination and control of driving seriously deteriorates (Treffner & Barrett, 2004). Data on the stability and skill of driving suggest that under such conditions a driver's perception and attention are compromised (Treffner, Barrett, & Petersen, 2002). This addresses the oft-noted observation that it is easier (less demanding of linguistic attention) for a driver to converse with a front-seat passenger than for a driver to speak to someone at a distance using a mobile phone. This may be because the driver can (peripherally) view various informative gestures of the adjacent passenger such as hand motion, head nodding, and facial expressions while simultaneously maintaining awareness of the immediate driving environment. The driver who uses a mobile phone has no such nonverbal information available and so deciphering the other person's speech is made all the more demanding of attention. Given the key role of attention in maintaining dynamic stability during critical acts of motor coordination (e.g., Treffner & Kelso, 1999), having nonverbal information available facilitates a more immediate and direct apprehension of linguistic content, which in turn can release attentional resources for the critical task of safely controlling the vehicle. Examples such as these demonstrate the prevalence and importance of nonverbal gestures in everyday communication.
Regarding hand preference during gestural coordination, Kimura (1973) noted that the right hand of right-handers was chosen for free movements that accompanied speech. However, no preference was found when displaying iconic gestures while speaking (Lausberg & Kita, 2003) and that the left and right hands were used equally often for beat gestures (Sousa-Poza, Rohrberg, & Mercure, 1979). In the experiment reported herein, hand gestures of the animated character were performed unimanually and exclusively using the right hand. Although not essential to the design of our experiment, this is consistent with our previous data on both left- and right-handers suggesting that speech-hand coordination dynamics are supported by a common timing system primarily involving the left cerebral hemisphere, regardless of handedness (Treffner & Peter, 2002; Treffner & Turvey, 1995, 1996).
In order to investigate the effects of speech-hand timing on sentence perception, it is necessary to strictly control the relative temporal positions of the manual gesture and its lexical affiliate. This is impossible in a normal biological system as the constraints, both neuromuscular and dynamical, are adamant of the requirement that normal functioning (goal-directed coordination for conveying meaning) should not be violated. For this reason we must create a situation that is not so artificial that it cannot be treated as potentially realistic but not so natural that it cannot be parametrically controlled. We therefore turned to computer animation and the ability to create a reasonably lifelike character or “avatar” that exhibits gestures during accompanying speech (Stanney, 2002).
Our prior research into speech-hand coordination focused on the production side of the phenomenon and how the synchrony or asynchrony (i.e., phasing) of speech-hand coordination evolves as a function of handedness, hand, and rate of production (Treffner & Peter, 2002). In the current research we focus on the perceptual consequences of viewing differential phasing phenomena. More precisely, we investigate how a listener's perception of the meaning of a speaker's utterance is influenced by visual information that specifies the relative timing of beat gestures and associated speech. In particular, regarding a sentence's semantics, we investigated whether the listener's perceived focus of a sentence depends on the temporal coordination of a speaker's gesture and speech and also whether the perceived clarity of a sentence depends on environmental context.
A total of 14 individuals (5 females and 9 males) ranging from 15 to 48 years of age, volunteered to participate. Participants did not know that gestures were the topic of interest. Advertisements and instructions simply described the experiment as “about communication.”
A full-color 3D computer-generated animated character was created as a means to synchronize a hand gesture to precise locations along the acoustic speech signal. Animations were created on a computer using key framing techniques using a combination of the animation editing programs Poser, 3D Studio Max, and Premiere. The phrase “Put the book there now.” was chosen as an utterance and the speaker's mouth, cheeks, and chin were rendered to reproduce, as faithfully as possible (given available desktop PC resources) the movements that would accompany such a phrase. An inter-frame resolution of 33.33 ms yielded an animation of 58 frames in length and a total animation sequence of 1.91 s in duration (Figure 1).1
The times between the end of one word and the beginning of the following were adjusted such that the gaps between words sounded equal. Almost all intonation (e.g., relative pitch and tone differences) was removed from the speech signal in an attempt to eliminate acoustic information about the “focus” of the utterance and, also important, removed a possible gaze bias toward hand gesture stemming from prosody (McClave, 1997). Although somewhat monotonic, the utterance was a processed version of actual continuous human utterances and was still reasonably natural; it certainly sounded more natural than synthesized “computer-speech.”
Because we are interested in the relation between relative timing of gestural movement and meaning, we chose as simple a hand movement as possible while still retaining its “gestural” properties. We chose a beat gesture that involved a simple “out-hold-in” hand motion beginning with the right hand with closed fist postured in front of the chest, extension at the wrist with extension of fingers (pre-stroke, 3 frames in duration), a brief hold of this posture (mid-stroke, 4 frames), then flexion of wrist and fingers (post-stroke, 3 frames) to return to a posture with closed fist positioned in front of the chest (Figure 2). The complete gesture lasted 333 ms or a “window” of 10 frames in size. In order to create animations for the experimental conditions, the gesture was incorporated into the animation such that the midpoint of the 10-frame gesture window was subsequently centered on each of 23 positions corresponding to frames 27 through 49 (inclusive) within the subsegment of the utterance corresponding to “… book … there …” (Figure 1). Because in each animation the middle of the 10-frame gesture window was centered on one of 23 individual frames from the “book-there” segment, the initiation of the gesture would begin earlier. For example, for a gesture 10 frames in length centered on frame 32 (beginning of “book”), the initiation of the gesture would occur in frame 27 (i.e., 32 - 5 = 27), which corresponds to the end of “the” (Figure 1). In sum, the animated speaking articulatory movements and audio speech signal remained constant for all animations. The main difference between the animations involved where the hand gesture was synchronized (i.e., which section of the speech signal). Since a “gesture” is not a momentary occurrence but rather a spatiotemporal event lasting 10 frames, synchronizing the gesture with different portions of the speech event involved sliding a gestural window along the speech signal; gestural location is then defined as the frame of the speech signal on which the gestural window is centered.
As we wished to investigate the role of environmental context in the perception of speech-hand gestural communication, half of the animations included a table in the background of the animated speaker while in the other half the table was absent. For control condition purposes, two animations were created whereby the speaker made no gesture (with and without a table). In these cases, the speaker's arms and hands remained motionless by his sides (Figure 3).
Either a large back-projection video screen (2.5 m x 2 m) or a computer monitor was used to present stimuli to participants. Participants were positioned at a distance from the large screen such that the visual angle subtended was similar to that when using the smaller computer monitor. Audio volume presented via loudspeakers behind either the large screen or the smaller monitor was adjusted to provide similar volumes. Video segments were played to participants using Media Player.
Each participant was seated in front of the display and a series of animations was shown whereby an animated male character uttered the phrase “Put the book there now.” The participant was required to view each animation and record responses on paper at the end of each individual 1.5-s animation. Participants were asked to address two specific issues:
Response sheets consisted of three columns headed “Trial number,” “Focus,” and “Rating.” In the “Focus” column, a participant was asked to circle one of the five words printed to indicate which word he or she thought was the speaker's intended focus of the sentence (i.e., “categorization”). Similarly, for each trial a participant had to indicate on a 5-point scale his or her level of agreement with the statement “The focus of the sentence was emphasized clearly” (i.e., “clarity”). The participant was asked to record his or her level of agreement by circling one of a list of five numbers (1 = strongly agree, 2 = agree, 3 = neither agree nor disagree, 4 = disagree, 5 = strongly disagree).
Several reasons pertain to designing a dependent measure of “clarity” for perceptual experiments. It provides an additional insight into the participant's perception. It urges the participant to attend carefully to what is perceived by requiring a “qualitative” more reflective second-order evaluation (how clear was the perception?) as well as the first-order “quantitative” judgement (what was perceived?). This avoids potentially less conscientious pencil-circling indications of one of the five words heard. It provides a way to evaluate the degree of “meaningfulness” of perception—to what extent the perceived word “makes sense.” It is one thing to “hear” a certain word uttered (e.g., consider a computerized parser or “speech perception” program); it is quite another to perceive that the word perceived makes sense given the accompanying spatiotemporal context of body movements and environment. The latter reason is perhaps the most forceful for incorporation of a measure of clarity of perception. The main thrust of our experimental design was to disassociate the normal tight synchrony between body and vocal languages in order to reveal the consequent effect on meaningfulness (or “clarity” or “certainty” or “sensibleness”) of judgment.
Participants were not constrained as to where in the animation they looked. Thus, participants were, as is normal, free to focus on the head, torso, face, mouth, hand, or any other combination of body or environmental features. However, in order to complete the experimental requirements of circling which word was perceived and how clearly it was perceived to be emphasized, it must be assumed that participants were aware of both hand motion and words uttered.
In the experiment there were 144 trials consisting of three repetitions of each of the 48 conditions The trials were completely randomized and each participant was presented with a different random order of 144 trials. Conditions included (a) 23 animations with gesture and a table, (b) 23 animations with gesture but no table, (c) one animation with no gesture and a table, (d) one animation with no gesture and no table. Conditions “c” and “d” were used as control conditions (Figure 3).
Mean percentages were calculated for the categorization and ratings data for each frame (on which the gesture was centered) by averaging across the three repetitions of each condition and across participants. The paired sample t test (p < .05) was used for both categorization and ratings data to compare table-present and no-table conditions as well as pairs of frames within a table-present or no-table condition. In accord with standard psychophysics conventions, the response threshold was set at 50% to determine the region in which speech-hand synchrony could be considered to influence the perceived focus of the sentence (i.e., categorization). To determine the region where the relative timing of gesture and speech appears to yield important information for perception and comprehension, we introduce the “perceived synchrony region (PSR)” for each of the five words in the utterance “Put the book there now.” The PSR is defined, using the results data, as the region bounded by the earliest and latest frames in which a particular word (e.g., “book”) is perceived (chosen) by the participant as the focus of the sentence 50% or more of the time when the gesture in the animation was synchronized with (centered on) that frame.
RESULTS AND DISCUSSION
Figure 4 shows the results for perceived focus and word categorization when the gesture was centered on (synchronized to) a particular frame of the animation and utterance. As there was no intonation information available in the speech of the animation, any perceived focus exhibited by participants must be due to factors other than tonal emphasis (e.g., gestural position, context, etc.). For the word book the PSR extended from Frame 28 to Frame 37 (69.05% and 64.28%, respectively) when the table was present and from Frame 29 to Frame 37 (64.28% and 73.81%, respectively) when the table was absent. For the word there the PSR extended from Frame 38 to Frame 48 (52.38% and 52.38%, respectively) for both table-present and table-absent conditions. Clearly, when a gesture is coordinated with a speaker's utterance, the gesture can dramatically influence a listener's perception of which word is emphasized. Further, perceived emphasis can be influenced by a gesture that precedes the start of the acoustic signal of the perceived word. Thus, in Figure 4 the “book” PSR (table present) begins at Frame 28 but the corresponding acoustic signal does not begin until Frame 31. At Frame 28, 69% of responses indicated “book” as the focus even though at this point the acoustic speech for “book” has not yet commenced. The middle of the gesture precedes the start of the speech signal for “book” by about 133 ms (4 frames). It should be noted the start of the gesture is a further 5 frames before the midpoint so in this example the onset of the associated gesture actually began at Frame 23 (= 28 - 5), which is in the middle of the speech for “the” and approximately 300 ms before speech onset for “book.” The “priming” effect of gesture—either fledgling pre-stroke or fully-fledged mid-stroke—on speech, may surreptitiously provide “heads-up” information to the listener that allows the listener's attention to be directed to currently unfolding (nonacoustic) information about the current communicative event, which can help a listener attune to and anticipate what the speaker's intended emphasis should be perceived to be.
Categorization of "book," table
Within the “book” PSR (Frame 28 to Frame 37), “book” was selected significantly more often when the gesture was centered on Frame 33 than on Frame 34 (85.71% vs. 73.81%, respectively; t(13) = 2.69, p < .05) (Figure 4). No other significant differences were present within these frames.
A significant difference was noted between Frame 27 at the end of the region for “the” and Frame 28 at the beginning of the “book” PSR (45.24% vs. 69.05%, respectively; t(13) = 2.22, p = .05). Also notable is the difference between the last Frame 37 in the “book” PSR and Frame 38, the first frame of the “there” PSR (64.28% vs. 38.09%, respectively; t(13) = 2.35, p < .05).
Categorization of "book," no table
Within the PSR for “book” (Frame 28 to Frame 37), the word book was selected significantly more often with the gesture centered on Frame 33 than on Frame 31 (88.09% vs. 71.43%, respectively; t(13) =2.18, p = .05). No other significant differences were present within the PSR for “book.”
No significant differences were noted between Frames 27 and 28 (38.09% vs. 50%, respectively). However, the difference noted between Frame 37 at the end of the “book” PSR and Frame 38 at the beginning of the “there” PSR was significant (73.81% vs. 42.86%, respectively; t(13) = 2.88, p < .05).
Categorization of "there," table
Within the “there” PSR (Frames 38 to 48), no significant differences between frames were found. A difference was found between Frame 37 at the end of the “book” PSR and Frame 38 at the beginning of the “there” PSR (28.57% vs. 52.38%, respectively; t(13) = 2.22, p = .05). Also notable is the difference between the last frame (48) in the “there” PSR and the first frame (49) in the “now” PSR (52.38% vs. 33.33%, respectively; t(13) = 2.51, p < .05).
Categorization of "there," no table
Within the “there” PSR (Frames 38 to 48), “there” was selected significantly more often in Frame 43 than in Frame 41 (88.09% vs. 76.19%, respectively; t(13) =2.18, p = .05). Also, “there” was selected significantly more often in Frame 43 than in Frame 44 (88.09% vs. 78.57%, respectively; t(13) =2.18, p < .05) and more often in Frame 46 than in Frame 47 (83.33% vs. 61.90%, respectively; t(13) = 2.59, p < .05). No other significant differences were found within this PSR.
A significant difference was noted between Frame 37 at the end of the “book” PSR and Frame 38 at the beginning of the “there” PSR (21.43% vs. 52.38%, respectively; t(13) = 3.04, p < .01). However, the difference between the last frame (48) in the “there” PSR and the first frame (49) in the “now” PSR was nearly significant (52.38% vs. 35.71%, respectively; t(13) = 1.99, p = .07).
No-gesture, table versus no table
Regarding the two control conditions (i.e., utterance alone, no synchronous gesture, arms by sides), with the table present, “there” was chosen most often as the focus of the sentence (19.05%), although the magnitude was not significantly greater than for any other word categorized. It is important to note that with the table absent, “there” was not selected at all and “book” was perceived significantly more often than “there” to be the focus of the sentence (26.19% vs. 0%, respectively; t(13) = 2.78, p < .05). However, “book” was not selected more often than “put” (7.14%), “the” (16.67%), or “now” (11.90%), and “now” was chosen significantly more often than “there” (t(13) = 2.11, p = .05).
Discussion of categorization results
The PSRs spanned between 9 and 11 frames (e.g., the PSR for “there” spanned Frames 38 through 48, inclusive). Within the PSR of a given word, few frames exhibited a selection rate that was significantly different from any other frame in the region. This result indicates that a synchronous hand gesture has a relatively uniform influence on the perceived focus of a spoken sentence regardless of position or timing of the gesture.
The effect of a beat gesture was evaluated by synchronizing the mid-stroke of the gesture to different positions along the speech signal of the animation. The participant's response to (perception of) this synchrony showed that the perceived focus of the speaker's utterance was based on a temporal region (the PSR) that begins well before the onset of the perceived word's speech signal and a region that ends fairly abruptly during or after the speech signal's offset. The PSR for “book” begins at Frame 28 before the speech signal that begins at about Frame 31, whereas both the PSR and the speech signal end at about Frame 37 (Figure 4). Likewise, the PSR for “there” begins at Frame 38 but the speech begins later, at around Frame 40; the PSR ends at Frame 48, as does the speech signal. This finding shows that a gesture that slightly precedes or is phase-advanced relative to its lexical affiliate will have greater influence on perceived emphasis than a gesture synchronized later or at the end of its associated speech-generated acoustic event. A gesture “perfectly synchronized” or in-phase relative to the accompanying acoustic event has the greatest chance of influencing a listener's perception of intended emphasis. However, given the phenomenon of coarticulation, it may be that phase-advanced gestural events help provide the important anticipatory information necessary to specify the intended content of speech acts that might otherwise become lost in and muddled by the astounding unsegmented continuity of the human acoustic speech signal.
The paradigm also allowed us to investigate the influence of environmental context and whether perception of the focus of a sentence was dependent on the presence of a relevant object (a table) in the environment that was potentially related to the topic of the sentence. Results indicated that there was no real difference in categorization likelihood between the table-present and table-absent conditions when gesture was involved. This held except for “book” near the start of its PSR (Frame 29), where “book” was perceived as the focus more often with the table than without the table (80.95% vs. 64.28%, respectively; t(13) = 2.46, p < .05). However, recall that for the no-gesture condition there was a definite effect of the table such that “there” was perceived as the focus significantly more often when the table was present than when absent.
Thus, when a gesture accompanies (monotonic) speech, the influence of environmental context appears diminished in comparison with the salient visual information provided by a synchronous hand gesture. However, the results suggest that environmental context can influence the perceived focus of a sentence in the absence of disambiguating information from gestures. An explanation for this finding would be that the table is perceived to afford a surface upon which to put a book and, moreover, to put it in a definite location (i.e., there). We take this as evidence that when apprehending the meaning of communicative utterances, perceivers take account of the affordances of the immediate environment and that the latter can contribute to the (direct) perception of the meaning of a speaker's utterance (Fowler, 1986; Gibson 1979/1986).
Transition region: "book" versus "there."
In Figure 4 a clear transition can be seen between the PSRs for “book” and “there.” Response data on the two PSRs were compared around the crossover region (Frames 36, 37, 38, 39, and 40). In the table-present condition, for Frame 36, “book” was chosen significantly more often than “there” (80.95% vs. 9.52%, respectively; t(13) = 6.20, p < .001). For Frames 37, 38, and 39, no significant differences were found between rates for choosing “book” and “there.” For Frame 40, “there” was chosen in preference to “book” (71.43% vs. 21.43%, respectively; t(13) = 3, p = .01). Thus, a 5-frame book-there crossover region exists for Frames 36 through 40, where uncertainty between perceiving “book” and “there” exists for the central 3 frames (37, 38, and 39). Preceding this uncertainty region with a gesture synchronized on Frame 36, participants overwhelmingly chose “book” over “there.” Likewise, following this region with a gesture synchronized on Frame 40, participants overwhelmingly chose “there.”
In the no-table condition, for Frame 36, “book” was chosen more often than “there” (76.19% vs. 19.05%, respectively; t(13) = 3.92, p < .005). Interestingly, for Frame 37, in contrast to the table-condition, “book” was chosen more often than “there” (73.81% vs. 21.43%, respectively; t(13) = 3.14, p < .01). For Frames 38 and 39, no differences were found between responses of “book” and “there”. For Frame 40, “there” was chosen more often than “book” (69.05% vs. 26.19%, respectively; t(13) = 2.59, p < .05). Hence, in the absence of a table, the crossover region and region of uncertainty was compressed compared with when a table was present and encompassed only 4 frames instead of 5 (37, 38, 39, and 40) with the central frames 38 and 39 providing a region of uncertainty between choosing “book” and “there.” Therefore, with a table present, a gesture centered on Frame 37 yielded uncertainty as to whether “book” or “there” had been emphasized (i.e., perhaps the observer thinks, “Maybe the word 'there' instead of 'book' was emphasized—that would make sense because there's a table in the background!”). This is all the more telling because at Frame 37 the acoustics speech signal for “book” has not finished and the speech signal for “there” has yet to commence.
In sum, the preceding results show that the absence of a table increases the disposition for a participant to perceive “book” as the focus when a synchronous gesture is performed, whereas the presence of an environmental object (a table) increases the disposition to perceive the focus of the sentence as a word that can also meaningfully relate to the contextual environment (i.e., “there”). This subtle but definite effect of the location of a synchronous gesture demonstrates that the environment of a speaker, if relevant to the speaker's utterance (i.e., it has action-related affordance properties), does influence the semantic content of perceived speech.
Clarity of "book," table
In the following sections, we report the extent to which participants agreed or disagreed that their perceived focus was emphasized clearly. Considering the table-present condition and those participants who perceived that the focus of the sentence was “book” (cf. the PSR for “book,” Frames 28-37), more participants “strongly agreed” that the emphasis was clear when the gesture was centered on Frame 29 (nearer the beginning of the speech signal for “book”) than when the gesture was synchronized with Frame 28 (21.43% vs. 9.52%, respectively; t(13) = 2.11, p = .05) (Figure 5). Also, more “strongly agreed” that the emphasis on “book” was clear for a gesture synchronized with Frame 34 (farther from the end of the speech signal for “book”) than when synchronized with Frame 35 (21.43% vs. 9.52%, respectively; t(13) = 2.11, p = .05).
For the “agree” rating, participants judged the emphasis on “book” to be clear more often when the gesture was centered on Frame 30 than on Frame 31 (52.38% vs. 30.95%, respectively; t(13) = 2.59, p < .05). Also, participants “agreed” the emphasis was clear more often for Frame 33 than for Frame 31 (45.24% vs. 30.95%, respectively; t(13) = 2.12, p = .05). More participants “agreed” the emphasis was clear for Frame 36 than for Frame 37 (47.62% vs. 23.81%, respectively; t(13) = 2.11, p = .05). No other significant differences were found for any of the other clarity ratings within the PSR for “book.” Also, no other differences were found for any of the ratings between Frame 27 at the end of the PSR for “the” and Frame 28 at the beginning of the “book” PSR. There was also no significant difference found between Frame 37 at the end of the “book” PSR and Frame 38 at the beginning of the “there” PSR.
Clarity of "book," no table
Without the table present, within the “book” PSR, more participants “strongly agreed” the emphasis on “book” was clear when the gesture was centered on Frame 33 than on Frame 31 (38.09% vs. 19.05%, respectively; t(13) = 2.51, p < .05) (Figure 5). Participants also exhibited ambivalence regarding clarity of their perceived focus such that more selected the rating “neither agree nor disagree” for gestures centered on Frame 35 (the relatively silent closure region toward the end of the speech for “book”) than for Frame 36 (corresponding to the voiceless stop, “k,” at the end of “book”) (21.43% vs. 4.76%, respectively; t(13) = 2.46, p < .05). No other significant differences were noted for adjacent frames of this PSR.
More participants “disagreed” that the emphasis on “book” was clear when the gesture was centered on Frame 27 (end of “the” PSR) than on Frame 28 (beginning of “book” PSR) (2.38% vs. 21.43%, respectively; t(13) = 2.51, p < .05). Correspondingly, more participants “agreed” that the emphasis on “book” was clear for Frame 37 (end of “book” PSR) than for Frame 38 (beginning of “there” PSR) (40.48% vs. 11.9%, respectively; t(13) = 2.28, p < .05).
Clarity of "there," table present
When the table was present, within the PSR for “there” (Frames 38-48) more participants “strongly agreed” that the emphasis on “there” was clear when the gesture was centered on Frame 39 closer to the start of the speech for “there” than farther from it on Frame 38 (19.05% vs. 4.76%, respectively; t(13) = 3.12, p < .01) (Figure 6). Similarly, more participants “agreed” the emphasis was clear for Frame 40 than for Frame 39 (42.86% vs. 26.19%, respectively; t(13) = 2.19, p = .05). Surprisingly, more participants “agreed” the emphasis was clear when the gesture was centered on Frame 43 than on Frame 42 (42.86% vs. 26.19%, respectively; t(13) = 2.46, p < .05) and, as might be expected, for Frame 46 compared with 47 (toward the end of the speech) (42.86% vs. 23.81%, respectively; t(13) = 2.28, p < .05). No other significant differences were noted between frames within the “there” PSR for the other clarity ratings.
Although no significant differences were found for any ratings between Frame 37 at end of “book” PSR and Frame 38 at the beginning of “there” PSR, more participants “agreed” the emphasis on “there” was clear for gestures centered on Frame 48 at the end of “there” PSR than on Frame 49 at the beginning of “now” PSR (16.67% vs. 2.38%, respectively; t(13) = 2.12, p = .05).
Clarity of "there," table absent
Without the presence of a table, and for “there” PSR, more participants “strongly agreed” the emphasis on “there” was clear for Frame 41 (which is close to the center of the speech signal) compared with Frame 40 (which is close to speech onset) (26.19% vs. 11.9% respectively; t(13) = 3.12, p < .01) (Figure 6).
For the frames flanking “there” PSR, the only significant difference noted was for the rating “agree” for gestures centered on Frame 37 at the end of “book” and on Frame 38 at the beginning of “there” (9.52% vs. 28.57%, respectively; t(13) = 2.51, p < .05). Also, more participants were ambivalent and “neither agreed nor disagreed” that the emphasis was clear for Frame 43 than for Frame 42 (16.67% vs. 4.76%, respectively, t(13) = 2.11, p = .05).
Clarity of "the" and "now"
As reported earlier, significant differences in clarity ratings were found most commonly within the PSRs for “book” and “there.” In contrast, no differences were found for “put,” and few differences were revealed for the words “the” and “now.” The only significant difference for clarity ratings for “the” involved the rating “agree” and was found in the no-table condition between Frame 27 at the end of the speech for “the” and Frame 28 at the start of “book” PSR (28.57% vs. 11.9%, respectively; t(13) = 2.88, p = .01).
For categorization responses of “now,” the only differences involved the ambivalent response, “neither agree nor disagree.” With table present, the ambivalent response was chosen less for Frame 48 at the end of “there” PSR than for Frame 49 at the beginning of “now” (2.38% vs. 14.28%, respectively; t(13) = 2.11, p = .05). Similarly, with table absent, the ambivalent response was selected less for Frame 47 near the end of “there” PSR than for Frame 48 near the beginning of “now” (0% vs. 9.52%, respectively; t(13) = 2.28, p < .05). Ambivalence may reflect indefinite perception. Since no conditions involved instances where a gesture was synchronized with the words “put,” “the,” or “now,” and PSRs were only found for those words where a synchronous gesture existed (“book” and “there”), the paucity of significant differences in clarity ratings for the other words reflects the lack of perceiving these words as the focus.
Clarity ratings, gesture: Table versus no-table
With gesture present, for the clarity rating “strongly agree,” no significant differences were found between table-present and table-absent conditions for any frame for any of the words. For the rating “agree,” a significant effect of the table was noted for the words “book,” “there,” and “the.” The number of participants who “agreed” that the emphasis on “book” was clear when the gesture was centered on Frame 28 (at the start of “book” PSR) was significantly greater when the table was present compared with when it was absent (23.81% vs. 11.9%, respectively; t(13) = 2.11, p = .05) (Figure 5). Oddly, for Frame 31, significantly more participants “agreed” that the emphasis on “book” was clear when the table was absent compared with when it was present (40.48% vs. 30.95%, t(13) = 2.28, p < .05). However, for Frame 35 (as with Frame 28), more participants “agreed” the emphasis on “book” was clear when the table was present compared with when it was absent (47.62% vs. 33.33%, respectively; t(13) = 2.12, p = .05.). Overall, it seemed the presence of a table increased clarity of perception that the focus was on “book.”
For perceptions of “there” as the focus of the sentence, when the gesture was centered on Frame 46, significantly more participants “agreed” that the emphasis on “there” was clear when the table was present compared with when it was absent (42.86% vs. 28.57%, respectively; t(13) = 2.48, p < .05) (Figure 6). Because of the close relation between the affordance properties of a table and the semantics of “there,” the preceding result can be well motivated.
For perceptions of “the” as the focus of the sentence, when the gesture was centered on Frame 27 more participants “agreed” that a focus on “the” was clear when the table was absent compared with when it was present (28.57% vs. 14.28%, respectively; t(13) = 2.12, p = .05). This can be understood by realizing that without the table, there is less bias from “book” and “there” (table-related concepts) that could reduce the participant's certainty and clarity that “the” was the focus of the sentence. For the ambivalent clarity rating of “neither agree nor disagree,” no significant differences were noted between table-present and no-table conditions in any of the frames for any word.
For the clarity rating “disagree,” significant differences were found for the words “book,” “there,” and “the.” In Fame 27 (immediately preceding “book” PSR), more participants “disagreed” that the perceived emphasis on “book” was clear when there was a table compared with when it was absent (21.43% vs. 2.38%, respectively; t(13) = 2.51, p < .05) (Figure 5). Thus, although the gesture occurred immediately prior to the PSR for “book,” some still perceived “book” as the emphasis of the sentence. In such cases, however, a participant thought his or her perception of “book” was more unclear when there was a table in the background than when there was no table.
It is important that in Frame 46 near the end of “there” PSR, more participants “disagreed” that the perceived emphasis on “there” was clear when there was no table compared with when it was present (23.81% vs. 9.52%, respectively; t(13) = 2.12, p = .05) (Figure 6). That is, without a table to help justify the speaker's emphasis on “there,” the listener thought his or her perception of “there” was unclear. This finding provides the complement for the previous result from the “agree” ratings for “there” at Frame 46, where it was shown that the presence of a table increased the clarity of perceiving “there” compared with when there was no table.
For those who perceived “the” as the emphasis, when the gesture was positioned at Frame 28 (after “the” PSR and beginning of “book” PSR), more “disagreed” that the perception of “the” was clear when there was no table compared with when it was present (11.9% vs. 0%, respectively; t(13) = 2.11, p = .05). That is, with the mid-stroke of the gesture located to follow “the” PSR, the participant was more unclear regarding his or her perception of “the” when there was no table than when there was a table. Indeed, when there was a table, no participant who perceived “the” thought that it was unclear. This result reflects the semantics of the verb “put” and the definite article “the.” The action word, “put,” entails a location in which to place something. A table in the speaker's background would be consistent with the first word spoken, “put.” For those participants who perceived and chose “the” as the emphasis, and although “the” was perceived unclearly (not surprising given the gesture's location), it was perceived more unclearly when there was no appropriate affordance (e.g., a table) onto which an article could be put (e.g., the definite article referred to by “the” in the sentence). Finally, for the clarity rating “strongly disagree,” no significant differences were found between table and no-table conditions for any of the words.
Clarity ratings, no gesture: Table versus no-table
When there was no co-occurrent gesture (i.e., the control condition), the only significant effect of the table was for the clarity rating “neither agree nor disagree” for the word “there.” More participants were uncertain whether the emphasis on “there” was clear when the table was present compared with when it was absent (11.0% vs. 0%, respectively; t(13) = 2.11, p = .05).
From the preceding sections on clarity of perception we can conclude that environmental context plays a subtle but definite role in affecting a listener's perceived meaning. Participants who perceived (categorized) the focus of the sentence to be “there” (due to gestural location) thought that such an emphasis was clearer (i.e., “made more sense”) when there was a table present than when there was no suitable referent for the word “there.” Furthermore, for those who perceived (categorized) “there” as the emphasis, more disagreed it was clear when there was no table compared with when there was a table. These results support our hypothesis that gestural emphasis in concert with relevant environmental context contributes significantly to the meaningfulness of perceived (categorized) speech.
The clarity ratings indicated that the perceived focus of the sentence was emphasized more clearly when the gesture was centered well within the PSR. If clarity ratings for “there” are placed in order of preference, most participants “agreed” that the focus of the sentence was clear, followed by “strongly agree” and then “neither agree nor disagree.” Provided the gesture was within the PSR, very few participants “disagreed” or “strongly disagreed” that the focus of the sentence was clear, especially near the center point of “there” PSR (Frame 43) (Figure 6). Subsequently there was a steady increase in responses that “disagreed” that the emphasis on “there” was clear as the gesture was moved toward the end of “there” PSR and was most dramatic for Frame 48 before falling on Frame 49 as perception (categorization) of “now” commenced. Overall, the ratings for clarity generally followed the profile of the categorization results. Participants thought that the emphasis perceived was also clear well before the onset of the speech signal and clarity of the word perceived decreased dramatically toward the end of the word's PSR.
The results lend support to the first hypothesis that the perceived focus of a sentence does depend on the timely coordination of a speaker's gestures and speech. The categorization results clearly show that the perceived semantic focus of a sentence alters as the gestural location shifts along the speech signal. Concomitantly, the clarity ratings reinforce the conclusion that the certainty of what speech is perceived increases with appropriate synchronization between body movement and vocalization. The second hypothesis, that the perceived semantic clarity of a sentence depends on environmental context, also found considerable support. Throughout the clarity ratings data, the presence of a table led to significantly different ratings of clarity compared with when there was no table. For example, more agreed that the emphasis on “book” was clearer when the table was absent compared with when it was present. Symmetrically, more agreed that the emphasis on “there” was clearer when a table was present than when absent. Further, in the absence of any biasing gesture, participants still selected “there” as the focus of the sentence significantly more when there was a table than without a table. Concomitantly, more participants who perceived “there” as the focus of the sentence disagreed that the emphasis was clear when there was no table compared with when there was a table. In sum, our data on the perception of gestured speech indicates that the clarity of perception, that is, how much of that which is perceived “makes sense,” significantly depends on relevant environmental context.
A variety of implications follow from these results in terms of understanding how humans naturally communicate, how best to conceive of the dynamical and neural bases for communication, and how technology such as computer animation might facilitate communication. Regarding the latter, our results emphasize that if animators wish to add emphasis, then the mid-stroke of a gesture should be synchronous or even precede the acoustic body of the word uttered. The addition of appropriate speech-hand coordination could add realism and accuracy and hence increase the effectiveness of computer-animated characters and avatars for human computer interfaces and virtual environments (Gullberg & Holmquist, 1999, 2002; Rogers, 1978; Stanney, 2002).
Our results add to (and suggest gestural versions of) recent demonstrations of the surprising flexibility of synchrony of speech articulators and the associated acoustics as seen in classic McGurk effect experiments, where the acoustics can lag visual stimuli by as much as 180 ms (Munhall, Gribble, Sacco, & Ward, 1996; Munhall &Tohkura, 1998), and investigations of asynchrony between facial motion and acoustics (Abry, Lallouache, & Cathiard, 1996; Santi, Servos, Vatikiotis-Bateson, Kuratate, & Munhall, 2003; Yehia, Kuratate, & Vatikiotis-Bateson, 2002; Yehia, Rubin, & Vatikiotis-Bateson, 1998). The studies of orofacial motion in particular have emphatically demonstrated that the speech acoustics can be better estimated by the 3D dynamics of the face than by the midsaggital motion of the anterior vocal tract (i.e., lips, tongue, and jaw) (Yehia, Rubin, & Vatikiotis-Bateson, 1998).
Our results expand the phenomenon of phase flexibility to include speech-hand and speech-gesture coordination. A recent study has also shown that nonverbal gestural phenomena such as head movement (visual prosody) have a marked effect on perceived speech, lexical segmentation, and comprehension. It was found that speech amplitude and pitch correlated strongly with natural head movement (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004). Further, when using an animated speaker such that the normal synchrony of head motion was disassociated from speech, perceivers had greater difficulty in perceiving relevant aspects of speech. Such findings attest to the importance of nonverbal gestures for understanding face-to-face communication.
With regard to understanding specifically interpersonal speech-hand communication, previous research has shown that in the production of speech, co-occurring hand gestures tended to be initiated either prior to or simultaneously with the initiation of the relevant word spoken (Krauss, 1988). This was thought to reflect a process whereby the gesture enhances retrieval of the word to be spoken (i.e., the “lexical affiliate”) from memory. This fits in with the current results that show, symmetrically, that the perception of the intended focus of a sentence is strongly influenced by a gesture provided that the gesture is produced prior to or simultaneous with the utterance. Further research could explore possibilities of helping individuals with speech impediments such as stuttering by focusing more on gestural techniques to increase communicative skills (e.g., Mayberry & Nicholadis, 2000; Mayberry & Shenker, 1997; van Lieshout, 2004).
What mechanisms might provide a basis for speech-hand coordination? It has been argued that language evolved from simple primate gestures rather than from vocal origins (Corballis, 2002, 2003). The articulatory phonology approach pioneered by Browman and Goldstein (e.g., 1992), as well as the dynamical systems-based modeling approach known as task dynamics (Saltzman & Byrd, 2000), proposes that phonological units, “articulatory gestures,” serve a dual role as units of language production and language perception and correspond to the constriction actions of distinct vocal organs. The constrictions can be modeled as states of a dynamical system such that the motion event of articulator movement constitutes an articulatory gestural unit. This view of articulatory gestures as dynamical systems provides a foundation for conceiving how produced and perceived language forms might be complementary for a language user.
The notion of gestural articulatory dynamics as a basis for speech production and perception has been extended to include co-occurrent simple manual activity (Treffner & Peter, 2002). It was proposed that the coordination dynamics governing the timing patterns of the hands and vocal tract also provides a basis for the direct perception of a speaker's linguistic intentions during communication. Experiments were conducted using a phase transition paradigm to examine the coordination of speech-hand gestures in both left-handed and right-handed individuals. In those experiments it was shown that patterns of speech-hand synchrony are determined by definite coordination dynamics that have been extensively investigated in a variety of biological coordination tasks over the last 2 decades (i.e., the Haken-Kelso-Bunz (HKB) model; e.g., Kelso, 1995). Results showed that in a simple monosyllabic speech-finger tap synchronization task that required either in-phase (e.g., /ba/ + tap…/ba/ + tap…, etc.) or anti-phase (e.g., /ba/…tap…/ba/ …tap…, etc.) coordination with a pacing metrononome, the asynchrony of the manual tap over the speech utterance (i.e., tap preceded speech) decreased as rate of production increased. That is, the kinematics of finger and jaw became increasingly coincident as performance rate increased. This behavior can be understood as a lawful consequence of the generic dynamics that account for such coordination behavior, in this case, a particular parameterization of the extended asymmetric HKB model (Treffner & Peter, 2002; Treffner & Turvey, 1995, 1996). Specifically, when a constant phase offset of 50░ between finger and jaw was introduced into the model (together with inclusion of “attentional” factors, the c-term and d-term of the extended HKB equation), then the empirical data was reproduced—both its quantitative form and qualitative evolution under parametric change. We considered the constant “perceptual offset” of 50░ to be consistent with evidence from speech perception experiments whereby individuals synchronized an external acoustic or motor event with the perceived centre of a syllable (i.e., the word's “p-center” or perceptual center). The perceptual offset of 50░ helps explain the large corpus of data on perceptual synchrony since, in our experiment, subjective, perceived synchrony between tap and speech (i.e., the in-phase task requirement) was intentionally achieved even though there was approximately a 50░ asynchrony between tap and speech kinematics (Treffner & Peter, 2002). The gesture (finger tap) tended to precede speech (jaw opening), and the degree of anticipation of gesture over speech decreased as the rate of production increased. That is, synchrony increased as rate of coordination increased.
What then is the relation of an intrinsic perceptual phase offset in the nonlinear dynamics underlying the production of speech-hand coordination (Treffner & Peter, 2002) to the current results on relative synchrony during perception of speech-hand coordination? We believe the relation lies in how a varying rate of production alters speech-hand synchrony and the consequences of such synchrony on the accuracy of perception, that is, in appropriately comprehending a speaker's utterances. If speech-hand coordination patterns follow lawful regularities determined and predicted by the dynamics of rhythmic oscillatory systems (e.g., Treffner & Peter, 2002), and social coordination relies primarily upon physical dynamics rather than logical inferential mechanisms (Marsh, Richardson, Barron, & Schmidt, 2006), then it is reasonable to propose that speech-gesture coordination co-evolved as a contemporaneous mechanism to speech alone and that it could be harnessed by communicative dyads for purposes of emphasis and clarity. Reciprocally, the fact that individuals in face-to-face conversation often misunderstand one another follows from the negative effect that speech-hand asynchrony has on clarity of meaning. However, mechanisms alone cannot explain phenomenal meaning. But a coordination mechanism based on dynamics can entail semantic properties precisely because it is situated within an ecological context, the biologically relevant environment in which the perception-action cycles supporting communicative intentions evolved.
Communication would be more accurately approached as a biological or, more generally, an ecological phenomenon (Millikan, 1984; Treffner, 1999b) that is based on mechanisms not unlike the phase transition or resonance dynamics of between-person entrainment (Schmidt & Turvey, 1994; Treffner, 1999a; Treffner & Turvey, 1993). Taking a step further, social psychologists could emphasize more subtle qualities and constraints of the ecologically real social environment, so-called values such as “caring”; these must be satisfied by those engaged in successful communication and should be recognized and incorporated into a full-fledged pragmatic theory of language (Hodges, 2007). Communication thus understood instantiates a mechanism (of resonance dynamics—the “syntax”) that entails awareness (of affordances—the “semantics”). This model replaces more conventional assumptions of the transmission and reception of semantics-free symbols that require (somehow) interpretation. Such semantic embellishment and integration into meaningful “mental concepts” is, however, theoretically untenable. Such a rebuff of information-processing approaches that purport to explain language perception is not new (Reed, 1996; Shaw et al., 1981; Turvey et al., 1981), but it remains a daunting challenge for ecological scientists to provide sufficiently convincing empirical evidence to dislodge the currently accepted dogmas of seductive representationalism and symbolic computational wizardry.
But why did we then choose to use computer animation to address issues of natural communication? By definition, every experimental manipulation results in a reduction in richness and, unfortunately, sometimes a distortion of the ecological information available in natural situations. The goal is always to minimize the impact of such intervention, minimize (or avoid) the distortion, and yet distill and describe the natural dynamics at play (Kelso, 1995). From a Gibsonian standpoint, we support the emphasis on “ecological validity” in experimental design that has, in part, emerged from the ecological psychology of James Gibson and his associates (e.g., Shaw et al., 1981). We have attempted to create an experimental design in this vein (i.e., keep and manipulate the essential information and dispense with the nonessential information). The current work is an attempt to investigate natural language communication by altering the stimulus display in relevant ways rather than impoverishing it. The phenomenon of natural language involves more than vocal, body, and environmental components. It is somehow the spatiotemporal mixture of all these facets that permits successful and meaningful exchange between organisms of their perceptions, viewpoints, thoughts, and intentions.
What relation, if any, might exist between the concept of speech-hand coordination as based in dynamics, and contemporary research into the neural correlates of linguistic ability? Recently, there has been growing evidence that neural systems involved in speech perception are tightly coupled with those involved in speech production, or more generally, that the observation and execution of action share a common neural substrate. It has been reported that both listening to speech and observing speech-related lip movements increased the excitability of motor units underlying speech production (Watkins, Strafella, & Paus, 2003), with increased excitability pronounced in the left hemisphere. Similarly, a significant increase in motor evoked potentials was recorded from the tongue in response to transcranial magnetic stimulation of the left primary motor cortex when listening to speech containing consonants for whose production tongue movements were required (Fadiga, Craighero, Buccino, & Rizzolatti, 2002). These findings are consistent with those of Haueisen and Knosche (2001), who demonstrated involuntary neural activity above the primary motor hand area in pianists listening to music. Observations such as those of Floel, Ellger, Breitenstein, and Knecht (2003), which show that linguistic tasks such as speech production and perception activate hand motor cortex, and those of Meister et al. (2003), who showed that during reading aloud there is an increase in excitability of hand motor cortex of the language-dominant hemisphere, add to a body of evidence that indicates a functional connection between the motor area for manual activity of the language-dominant hemisphere and regions subserving language comprehension. Indeed, it was observed that hand gestures influenced acoustical encoding of co-produced speech. This suggests that hand gestures are integrated with speech (at least in a speaker) at an early stage of language processing (Kelly, Kravitz, & Hopkins, 2004).
All these observations support theories that favor a strong evolutionary and functional link between manual gestures and the development of speech and language. According to such theories, the evolution of language is based on a neural motor system that specializes in action recognition. Manual gestures, as a form of goal-directed action, are well disposed for conveying meaning to such an action-attuned perceptual system. The discoveries of Ferrari, Gallese, Rizzolatti, and Fogassi (2003) of new categories of “mirror” neurons in F5 in monkey cortex included mouth mirror neurons and a population of hand mirror neurons named audiovisual mirror neurons that became active when the monkey performed a hand action or when it only heard the action-related sound. Indeed, Taira et al. (1990) argued that cells of parietal cortex are responsive to the affordances for grasping (their term) available in the visual system and in turn transmit this activity to F5. A laudable attempt to create a neural network framework incorporating action, mirror neurons, and perception (of affordances) is provided by Arbib and colleagues (Arbib, 2005; see also commentaries). Further understanding of how a transition from a system for basic action recognition might have become equipped for language is provided by Kohler et al. (2002). The fact that audiovisual mirror neurons are a subcategory of hand mirror neurons offers an insight into language evolution and the strong coupling between manual gestures, sound, and ultimately, articulatory gestures. However, the evidence for an action-based perceptual system for language that is provided by data on mirror neurons is not unique to monkey. In an fMRI study in humans, it was shown that a participant's observation of actions performed with the hand, mouth, or foot led to the activation of distinctly different parts of Broca's area and offers further support for the existence of neural correlates of a speech-gesture coordination system in humans (Buccino, Binkfski, & Riggio, 2004).
Although we take such results on the neural mechanisms of speech-hand coordination as confirmation of a deeply seated relation between perception and action, such neural mechanisms are not by themselves explanatory. We hope that the present results help foster mutual progress in both relatively macroscopic (ecological/information-based) and relatively microscopic (neural/material-based) approaches to an emerging multidisciplinary science of information-based behavior. Furthermore, communication is necessarily a social phenomenon that occurs between persons and thus involves interpersonal information; neural measurements cannot delineate the specificational information about affordances at the ecological scale to which communicative acts often refer. Similarly, symbolic descriptions at the cognitive (representational) scale are inadequate for capturing the subtleties of perception-action behavior (Borghi, 2004; Shaw, 2001; Shaw et al., 1981; Turvey et al., 1981).
We hope that our work will further understanding of the complex temporal relations between gesture and speech. As one of the pioneers of gesture research, Adam Kendon, has implored, “Further studies of how gesture contributes to understanding in interaction are very much needed” (Kendon, 2004, p.94). However, in order to understand how gestures facilitate meaningful communicative events, a spotlight should also be shone on the affordance structure of the environment and how the coordination dynamics and associated rhythms of perceiver-actors interface with such environments. Anything less runs the risk of assumed mechanisms with little regard for context and binding relations. Nonverbal communication has the advantage of extending beyond the ephemeral temporal domain and into the domain of contextually situated, spatiotemporal, visible events. “Events are perceivable but time is not” (Gibson, 1975). Given the temporally extended dynamics of behavioral interaction (Schmidt, 2007; Treffner & Kelso, 1999), the challenge, it seems, is to harness technology in such a way that it does not obscure or distort the fundamental event-based nature of ecological social phenomena. If the mechanism of spatiotemporal event perception is dynamics-based and not memory-based (Treffner & Kelso, 1999), then of particular interest is how the precise timing relations we have shown might be harnessed for the benefit of emerging human interface technologies that exploit computer animation, social interaction in virtual environments, and animation-based provision of information such as via animated characters (avatars). The current research shows that clarity of exposition can be enhanced by the appropriate phasing of body gestures and speech, either with the accompanying gesture phase-advanced or approximately synchronous with speech. Instructional learning via displays of avatars or virtual instructors would benefit by increasing the prominence of gesture in actual or virtual instructors since it has been shown that accompanying gesture can significantly improve the acquisition of concepts in children (Singer & Goldin-Meadow, 2005). Conversely, gesturing with the requisite phase lag between speech and associated gesture is likely to result in lesser clarity. Interestingly, if intentionally created, such phase-lagged gestures could be exploited to a speaker's advantage by increasing the ambiguity or “open-endedness” of an utterance, or even to intentionally obfuscate, confuse, and confound a listener, or to conceal a speaker's true intentions. Whether avatars will remain as explicitly programmed animations produced by the rather old-fashioned explicit key framing method (as were the present simulations), or whether they will evolve (self-organize) into surprisingly naturalistic renditions of biological motion, perhaps using the wealth of research on biological coordination dynamics that has been produced by dynamics-inspired ecological psychologists, remains to be seen. The latter is surely inevitable.
The current results and those of our previous work on simultaneous tapping and babbling (Treffner & Peter, 2002) show that the inherent phase-lag in speech-hand dynamics can be entrained by an intentional communicative system to the mutual advantage of those involved in a dialogue (see also Pettito et al., 2004). We believe the perceived synchrony of gestures and words is based on a flexible spatiotemporal dynamics around which the binding of speech and movement, and its grounding in the environment, can occur in a meaningful or ecologically relevant way. This binding and grounding constitutes the speaker's intentions such that they can be presented to (not represented in) an appropriately attuned listener who can then effectively reach out, “grasp,” and understand the intended message. Under such ecological conditions, a listener can experience the direct perception of meaning.
Results were first presented at Symmetry and Duality: Principles for an Ecological Psychology, a Festschrift in honor of Bob Shaw and his contributions to experimental psychology, which occurred at the University of Connecticut, June 2004. Thanks to Eric Vatikiotis-Bateson, Pascal van Lieshout, and an anonymous reviewer for their helpful comments on the manuscript.
1Example animations from the experiment can be viewed and downloaded on the Internet at http://gesturesandphases.tripod.com and http://www.trincoll.edu/depts/ecopsyc/gestures.htm. They can also be requested from Paul Treffner at firstname.lastname@example.org.
List of Figures
FAQs in: English . Franšais . Espa˝ol .
ę 2009 Informa plc