Co-speech Gesture Video Generation with 3D Human Meshes

Ablation Study (Mesh to Video)

Given a sequence of input untextured meshes, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . We also provide the corresponding ground-truth video for comparison.

Input Mesh Ours (w/o Depth + Normal) Ours (w/o Textured Mesh) Ours Ground Truth Video Oliver

Input Mesh Ours (w/o Depth + Normal) Ours (w/o Textured Mesh) Ours Ground Truth Video Conan

Input Mesh Ours (w/o Depth + Normal) Ours (w/o Textured Mesh) Ours Ground Truth Video Seth

Input Mesh Ours (w/o Depth + Normal) Ours (w/o Textured Mesh) Ours Ground Truth Video Chemistry

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.

ECCV 2024

Ablation Study (Mesh to Video)

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* 1

Richa Mishra* 1

Renda Li 2,3

Ziyi Chen 4

Boyang Ding 2,3

Shoulei Wang 2,3

Jun-Yan Zhu 1

Peng Chang 4

Mei Han 4

Jing Xiao 3

1 Carnegie Mellon University 2 University of Science and Technology of China 3 Ping An Technology 4 PAII Inc.

ECCV 2024

Ablation Study (Mesh to Video)

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.