× close
Given a single portrait image, a voice audio clip, and an optional set of other control signals, our approach produces high-quality, lifelike talking face videos at resolutions of 512 × 512 and up to 40 FPS. will generate. The method is versatile and robust, and the generated talking faces can faithfully imitate human facial expressions and head movements, reaching a high level of realism and lifelike expression. (All photorealistic portrait images published in this paper are virtual, non-existent identities.) Credit: arXiv (2024). DOI: 10.48550/arxiv.2404.10667
A team of AI researchers at Microsoft Research Asia has developed an AI application that converts still images of people and audio tracks into animations that accurately depict individuals speaking or singing the audio tracks with appropriate facial expressions.
The team published a paper explaining how they created the app. arXiv Preprint server. Video samples are available on the research project page.
The research team aimed to display lifelike facial expressions while creating animations where the still images spoke and sang using the provided backing audio track. They were clearly successful in developing VASA-1. VASA-1 is an AI system that transforms still images, whether captured with a camera, drawn, or painted, into what it describes as “exquisitely synchronized” animation.
The group proved the effectiveness of the system by posting short video clips of their test results. In one piece, a cartoon version of Mona Lisa performs a rap song. In another picture, a picture of a woman is transformed into a singing performance, and in yet another picture of a man giving a speech.
In each animation, facial expressions change according to the words to emphasize what is being said. Researchers also note that although the videos appear to be real, closer inspection can reveal flaws and artificially generated evidence.
The research team achieved this result by training the app using thousands of images with different facial expressions. They also note that the system currently produces 512 × 512 pixel images running at 45 frames per second. It also took an average of 2 minutes to create a video using his desktop-grade Nvidia RTX 4090 GPU.
The researchers suggest that VASA-1 could be used to generate highly lifelike avatars for games and simulations. At the same time, they acknowledge the potential for abuse and therefore do not make the system available for general use.
For more information:
Sicheng Xu et al, VASA-1: Real-time audio-driven talking faces generated in real-time, arXiv (2024). DOI: 10.48550/arxiv.2404.10667
Project page: www.microsoft.com/en-us/research/project/vasa-1/
Magazine information:
arXiv
© 2024 Science X Network