Co-speech Gesture Video Generation with 3D Human Meshes

1 Carnegie Mellon University    2 University of Science and Technology of China   
3 Ping An Technology    4 PAII Inc.

ECCV 2024

Paper Code Project Gallery

Gallery --- Our Results (Audio to Video)

We introduce a method to generate co-speech gesture video of an actor from audio input - an especially challenging task to generate realistic hands when only a relatively small video of the actor is present during training. Please play the audio in each of the video to listen to the input speech.

Input Mesh

Textured Mesh

Generated Video

Oliver


Input Mesh

Textured Mesh

Generated Video

Conan


Input Mesh

Textured Mesh

Generated Video

Seth*


Input Mesh

Textured Mesh

Generated Video

Chemistry*


* The hands of "Seth" and "Chemistry" in the generated video from audio input contain cloudy artifacts because (1) the number of training frames for these examples is extremely small (< 7K frames), (2) The SMPL-X mesh sequence generated from audio at inference can be different compared to training mesh distribution. We observe that we need more than 25K frames for training the GAN model to make it robust to out of domain mesh sequence at inference (as in case of "Oliver" ).