Co-speech Gesture Video Generation with 3D Human Meshes

Baseline Comparison (Mesh to Video)

We compare video generated by our method, that uses intermediate rendering of 3D meshes as conditioning to baseline method that uses 2D keypoints as intermediate representation. The 2D keypoints are extrated from Mediapipe from ground-truth video. We also provide the corresponding ground-truth video for comparison.

Keypoint Maps (Mediapipe) 2D Baseline Input Mesh Ours Ground Truth Video Oliver

Keypoint Maps (Mediapipe) 2D Baseline Input Mesh Ours Ground Truth Video Conan

Keypoint Maps (Mediapipe) 2D Baseline Input Mesh Ours Ground Truth Video Seth

Keypoint Maps (Mediapipe) 2D Baseline Input Mesh Ours Ground Truth Video Chemistry

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.

ECCV 2024

Baseline Comparison (Mesh to Video)

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* 1

Richa Mishra* 1

Renda Li 2,3

Ziyi Chen 4

Boyang Ding 2,3

Shoulei Wang 2,3

Jun-Yan Zhu 1

Peng Chang 4

Mei Han 4

Jing Xiao 3

1 Carnegie Mellon University 2 University of Science and Technology of China 3 Ping An Technology 4 PAII Inc.

ECCV 2024

Baseline Comparison (Mesh to Video)

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.