Co-speech Gesture Video Generation with 3D Human Meshes

1 Carnegie Mellon University    2 University of Science and Technology of China   
3 Ping An Technology    4 PAII Inc.

ECCV 2024

Back

Ablation Study (Audio to Video)

Given an audio input, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . Please play the audio in each of the video to listen to the input speech.

Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Oliver


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Conan


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Seth*


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Chemistry*


* The hands of "Seth" and "Chemistry" in the generated video from audio input contain cloudy artifacts because (1) the number of training frames for these examples is extremely small (< 7K frames), (2) The SMPL-X mesh sequence generated from audio at inference can be different compared to training mesh distribution. We observe that we need more than 25K frames for training the GAN model to make it robust to out of domain mesh sequence at inference (as in case of "Oliver" ).