Not so long ago, some apps let you bring your photos to life with GIF-like movements. We now have AI systems that can make photos dance and sing. A team of AI researchers at Microsoft Research Asia has created an AI application that can turn static images of people and audio tracks into animations. This is more than just an animation. The output reportedly shows exactly how the people in the images speak and sing along to the audio tracks, along with appropriate facial expressions.
Our latest application, Vasa, is a framework that generates lifelike conversational faces of virtual characters with compelling visual affective skills (VAS) from a single still image and audio clip. “Our premier model, the VASA-1, not only produces lip movements that are exquisitely synchronized with speech, but also captures facial nuances and natural head movements that contribute to the perception of authenticity and lifelike sensations. It can also capture large areas,” the researchers wrote. In a paper describing the framework.
According to the team, core innovations include global facial dynamics and a head motion generation model that works in the facial latent space, as well as a head motion generation model that works in the latent space of an expressive, disentangled face using video. This includes development. Through extensive experimentation and evaluation on a set of new metrics, the research team said their method could significantly outperform previous methods in various aspects.
“Our method not only provides high-quality videos with realistic face and head dynamics, but also supports online generation of 512×512 videos at up to 40 FPS with negligible start-up delay. “This paves the way for real-time engagement with lifelike avatars that emulate human conversational behavior,” the researchers wrote.
What is VASA-1?
Microsoft researchers claim their new method can do more than just generate labial speech. synchronization However, it can also create a wide range of expressive facial nuances and natural head movements. “It can process audio of any length and consistently output seamless talking face videos.”
Researchers working on VASA-1 embarked on the ambitious task of bringing static images to life, making them speak, sing, and express emotions in perfect sync with audio tracks. VASA-1 is the result of their efforts to enable AI systems to transform static visuals such as photographs, drawings, and paintings into images that look like this: synchronized animation. Regarding control, the researchers argued that the diffusion model can accept optional signals as conditions such as primary gaze direction, head distance, and emotional offset.
Based on the research paper, the team showcased the capabilities of the VASA-1 system through a number of video clips. In one of his, a cartoon version of the Mona Lisa comes to life and breaks out into a rap song. In this example, Mona Lisa’s facial expressions and lip movements perfectly match the lyrics. Meanwhile, another example shows a photo of a woman transformed into a singing performer. Another example shows a drawn portrait of a man giving a speech, and you can notice that his facial expression naturally changes to: emphasize words spoken.
How was VASA-1 made?
According to the research paper, the VASA-1 breakthrough occurred through an extensive training process. This involved exposing the AI system to thousands of images depicting different facial expressions. This vast dataset reportedly allows the system to learn and accurately reproduce the nuances of human emotions along with speech patterns. The current iteration of VASA-1 produces high resolution visuals of 512X512 pixels at a frame rate of 45fps and looks smooth. Reportedly, these realistic animations take an average of 2 minutes to render, which is made possible by using the computing power of his desktop-grade Nvidia RTX 4090 GPU.
Although the research paper does not explicitly mention a release date, it states that VASA-1 will bring us closer to a future where AI avatars can interact naturally, suggesting that it is currently a research prototype. Although VASA-1’s potential use cases are wide-ranging, researchers acknowledge its potential for abuse. As a precautionary measure, it has reportedly been decided to withhold public access to VASA-1. They acknowledge the need for responsible management of such advanced technologies to mitigate unintended consequences and misuse.
These animations seamlessly combine visuals and audio, giving them a realistic appeal, but upon closer inspection, you may notice subtle flaws and obvious signs typical of AI-generated content. Researchers say there is. Nevertheless, the examples shared demonstrate the technical excellence of the team that has worked on VASA-1.