VOCA and synthesized audio

VOCA(Voice Operated Character Animation) is a speech-driven animation framework that accurately recreate the animation and outputs the results into a mesh sequence and a rendered video of it. The publicly available implementation already provides pre-trained model. It uses the Mozilla
DeepSpeech to recognize the speech back to text, the use the text as features in training the network.

Yennefer: A pot of tea helps to past the evening.(DCTTS)

Geralt: Not going back but not about to waste anytime either, gonna use every minute wisely. (Game audio)

There are options to use different base 3d mesh, which can also be generated by a single photo from another project they created. And then with the flame framework, it’s also possible to add natural movements of eyes and head. The camera of PyRender can also be adjusted, as demonstrated from the above samples.

Since it’s using the deepspeech, taking from the experience of actually using deepspeech itself to transcribe text, it’s essential to add silence (about 0.3s after testing) before the actual beginning so that it could recognize the words at the beginning(not sure why). Therefore, adding the silence gives a better result just as I thought.

Overall, it definitely gives a more precise result, but not as natural as sda(speech-driven-animation with temporal GAN). And because of the fact that I’m not really familiar with 3d rendering in Python, it would be a better idea to use sda. Perhaps it would be more applicable for video game development, but still it’s interesting to discover this awesome project.

This is the result of using those weirdly looking white mesh video as the driving video for first-order motion model:

Gigantic mouth.