Given an audio input, we study the role of (1) depth and normal maps and (2) textured meshes for generating videos. We compare our full method that uses depth, normal maps and textured rendering of the meshes as input to generate video, to (1) Ours (w/o depth and normal maps) which does not use additonal depth and normal map conditioning as input, and (2) Ours (w/o textured mesh), which only uses rendering of untextured meshes as input . Please play the audio in each of the video to listen to the input speech.
Input Mesh Ours Ours Ours Oliver |
---|
Input Mesh Ours Ours Ours Conan |
Input Mesh Ours Ours Ours Seth* |
Input Mesh Ours Ours Ours Chemistry* |
* The hands of "Seth" and "Chemistry" in the generated video from audio input contain cloudy artifacts because (1) the number of training frames for these examples is extremely small (< 7K frames), (2) The SMPL-X mesh sequence generated from audio at inference can be different compared to training mesh distribution. We observe that we need more than 25K frames for training the GAN model to make it robust to out of domain mesh sequence at inference (as in case of "Oliver" ).