Co-speech Gesture Video Generation with 3D Human Meshes

Our Results (Audio to Video)

We introduce a method to generate co-speech gesture video of an actor from audio input - an especially challenging task to generate realistic hands when only a relatively small video of the actor is present during training. Please play the audio in each of the video to listen to the input speech. For viewing the results for all the speakers please refer to our Gallery --- Our Results (Audio to Video) page.

Note: The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.

Input Mesh

Textured Mesh

Generated Video

Our Results (Mesh to Video)

We show results of generated texture map and generated video by our mehotd from a given untextured mesh sequence. We also provide the corresponding ground-truth video for comparison. For viewing the results for all the speakers please refer to our Gallery --- Our Results (Mesh to Video) page.

Note: The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.

Input Mesh

Textured Mesh

Generated Video

Ground Truth Video

Baseline Comparison (Mesh to Video)

We compare video generated by our method, that uses intermediate rendering of 3D meshes as conditioning to baseline method that uses 2D keypoints as intermediate representation. The 2D keypoints are extrated from Mediapipe from ground-truth video. We also provide the corresponding ground-truth video for comparison. For viewing the results for all the speakers please refer to our Gallery --- Baseline Comparison (Mesh to Video) page.

Note: The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.

Keypoint Maps
(Mediapipe)

2D Baseline

Input Mesh

Ours

Ground Truth Video

Ablation Study (Mesh to Video)

Given a sequence of input untextured meshes, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . We also provide the corresponding ground-truth video for comparison. For viewing the results for all the speakers please refer to our Gallery --- Ablation Study (Mesh to Video) page.

Note: The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.

Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Ground Truth Video

Ablation Study (Audio to Video)

Given an audio input, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . Please play the audio in each of the video to listen to the input speech. For viewing the results for all the speakers please refer to our Gallery --- Ablation Study (Audio to Video) page.

Note: The background of each actor in the training sequnece is not constant (moves because camera is not still across entire training video). This leads to sight change in background in the generated results.

Input Mesh

Ours
(w/o Depth + Normal)

Ours
(w/o Textured Mesh)

Ours

Citation

Related and Concurrent Works

Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu. Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model. CVPR 2024.

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt. ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis. CVPR 2024.

Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard. From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations. arXiv 2021.

Acknowledgements

We thank Kangle Deng, Yufei Ye, and Shubham Tulsiani for their helpful discussion. The project is partly supported by Ping An Research.
The website template is taken from Custom Diffusion (which was built on DreamFusion's project page).

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.

ECCV 2024

Abstract

Method Pipeline

Our Results (Audio to Video)

Our Results (Mesh to Video)

Baseline Comparison (Mesh to Video)

Ablation Study (Mesh to Video)

Ablation Study (Audio to Video)

Citation

Related and Concurrent Works

Acknowledgements

Co-speech Gesture Video Generation with 3D Human Meshes

Aniruddha Mahapatra* 1

Richa Mishra* 1

Renda Li 2,3

Ziyi Chen 4

Boyang Ding 2,3

Shoulei Wang 2,3

Jun-Yan Zhu 1

Peng Chang 4

Mei Han 4

Jing Xiao 3

1 Carnegie Mellon University 2 University of Science and Technology of China 3 Ping An Technology 4 PAII Inc.

ECCV 2024

Abstract

Method Pipeline

Our Results (Audio to Video)

Our Results (Mesh to Video)

Baseline Comparison (Mesh to Video)

Ablation Study (Mesh to Video)

Ablation Study (Audio to Video)

Citation

Related and Concurrent Works

Acknowledgements

Aniruddha Mahapatra* ¹

Richa Mishra* ¹

Renda Li ^2,3

Ziyi Chen ⁴

Boyang Ding ^2,3

Jun-Yan Zhu ¹

Peng Chang ⁴

Mei Han ⁴

Jing Xiao ³

¹ Carnegie Mellon University ² University of Science and Technology of China
³ Ping An Technology ⁴ PAII Inc.