
On Tuesday, Microsoft Research Asia announced VASA-1, an AI model that can create synchronized animated videos of people talking and singing from a single photo and an existing audio track. In the future, anyone with a tool to power virtual avatars that render locally without the need for a video feed, or a similar tool, will be able to take a photo of someone they find online and see what they say. You might be able to make it seem like you’re saying whatever you want.
The abstract of the accompanying research paper, titled “VASA-1: Real-time generated lifelike voice-driven conversational faces,” states, “Real-time engagement with lifelike avatars that emulate human conversational behavior. The way will be opened.” This is the work of his Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo.
The VASA framework (short for “Visual Affective Skills Animator”) uses machine learning to analyze still images and audio clips. You can generate realistic videos with accurate facial expressions, head movements, and lip-syncing to audio. Rather than replicating or simulating speech (as with other Microsoft research), it relies on existing audio input that has been recorded or spoken specifically for a specific purpose.
Microsoft claims that this model significantly exceeds previous audio animation methods in terms of realism, expressiveness, and efficiency. To our eyes, it looks like an improvement over previous single-image animation models.
AI research efforts to animate a single photo of a person or character have been around for at least a few years, but recently researchers have turned to automatically syncing the generated video to an audio track. It is working. In February, an AI model called EMO: Emote Portrait Alive, developed by Alibaba’s Intelligent Computing Institute research group, was introduced in VASA, which can automatically sync animated photos to a provided audio track (called “Audio2Video”). It attracted a lot of attention with the same approach as -1. .
Training with YouTube clips
Microsoft researchers trained VASA-1 on the VoxCeleb2 dataset, created in 2018 by three researchers at the University of Oxford. According to the VoxCeleb2 website, its dataset includes “over 1 million utterances of 6,112 celebrities” extracted from videos uploaded to YouTube. VASA-1 is reported to be able to produce 512×512 pixel resolution video at up to 40 frames per second with minimal latency, meaning it could potentially be used for real-time applications such as video conferencing. Masu.
To show off the model, Microsoft created a VASA-1 research page with a number of sample videos showing the tool in action, including people singing and talking in sync with pre-recorded audio tracks. . These show how you can control the model to express different moods and change its gaze. Examples of this include the more fanciful generation, such as Mona Lisa rapping to an audio track of Anne Hathaway performing the song “Paparazzi” on Conan O’Brien.
The researchers say that for privacy reasons, each sample photo on the page (with the exception of Mona Lisa) was AI-generated by StyleGAN2 or DALL-E 3. However, it is clear that this technique is equally applicable to photos of real people, but may be more effective if the person resembles a celebrity present in the training dataset. Still, researchers say deepfaking real people was not their intention.
“We are researching the generation of visual emotional skills for virtual interactive characters. [sic], does not impersonate any real-world person. “This is just a research demonstration and there are no plans to release a product or API,” the site says.
Microsoft researchers tout potential positive applications such as improving educational equity, increasing accessibility and providing therapeutic companionship, but the technology could also be easily exploited. . For example, he can spoof a video chat, make it seem like a real human is saying something he’s not actually saying (especially when combined with a cloned audio track), or he can make it look like he’s saying something from a single social media photo. It becomes possible to permit harassment.
At the moment, the generated video still looks incomplete in some ways, but for those who didn’t know to expect AI-generated animations, it might be quite convincing. The researchers say they are aware of this, which is why they do not publicly release the code that powers the model.
“We oppose any activity that creates misleading or harmful content about real people, and we are interested in applying our technology to advance counterfeit detection. ” the researchers wrote. “Currently, videos produced in this way still contain discernible artifacts, and numerical analysis shows that there is still a gap in achieving the authenticity of authentic videos. .”
Although VASA-1 is just a research demonstration, Microsoft is not the only group developing similar technology. If the recent history of generative AI is any guide, it may only be a matter of time before similar technologies become open source and freely available, and can realistically improve over time. It is very likely that this will continue to be the case.