Co-speech Gesture Video Generation with 3D Human Meshes

1 Carnegie Mellon University    2 University of Science and Technology of China   
3 Ping An Technology    4 PAII Inc.

ECCV 2024

Back

Ablation Study (Mesh to Video)

Given a sequence of input untextured meshes, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . We also provide the corresponding ground-truth video for comparison.

Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Ground Truth Video

Oliver


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Ground Truth Video

Conan


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Ground Truth Video

Seth


Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Ground Truth Video

Chemistry