We introduce a method to generate co-speech gesture video of an actor from audio input - an especially challenging task to generate realistic hands when only a relatively small video of the actor is present during training. Please play the audio in each of the video to listen to the input speech.
Input Mesh Textured Mesh Generated Video Oliver |
---|
Input Mesh Textured Mesh Generated Video Conan |
Input Mesh Textured Mesh Generated Video Seth* |
Input Mesh Textured Mesh Generated Video Chemistry* |
* The hands of "Seth" and "Chemistry" in the generated video from audio input contain cloudy artifacts because (1) the number of training frames for these examples is extremely small (< 7K frames), (2) The SMPL-X mesh sequence generated from audio at inference can be different compared to training mesh distribution. We observe that we need more than 25K frames for training the GAN model to make it robust to out of domain mesh sequence at inference (as in case of "Oliver" ).