Microsoft Research Asia has unveiled an AI model that can generate horrifyingly realistic deepfake videos from a single still image and audio track. How can we trust what we see and hear online in the future?
As mentioned earlier, artificial intelligence systems have been blowing us away on major benchmarks for the past few years, but already many are leaving us out to pasture prematurely and are very concerned that they will be replaced by algorithms. Concerned.
We have recently witnessed the transformation of fairly limited smart gadgets into powerful everyday assistants and essential productivity tools. Additionally, there are models that allow you to generate realistic sound effects on silent video clips or create stunning footage from text prompts. Microsoft’s VASA-1 framework seems like another big step forward.
After training the model on footage of nearly 6,000 real-world talking faces from the VoxCeleb2 dataset, the technology not only allows newly animated subjects to lip-sync accurately to the provided voice audio track; , can produce terrifying real-life videos. You can also recreate a variety of facial expressions and natural head movements, all from a single still headshot.
It’s very similar to Alibaba’s Intelligent Computer Laboratory’s Audio2Video diffusion model that came out a few months ago, but even more photorealistic and accurate. VASA-1 can reportedly produce synchronous video at 512×512 pixels and 40 frames per second with “negligible start-up delay.”
All of the reference photos used to demonstrate the project were AI-generated by StyleGAN2 or DALL-E, but some notable examples were used to show off the framework’s ability to step outside of the training set. There is one. It’s rap. Mona Lisa!
The project page has many examples of talking and singing videos generated from still images and matched with audio tracks, but the tool also includes features such as facial expressions such as emotion, facial expressions, and distance from a virtual video camera. There are also optional controls for setting dynamics and head pose. and the direction of the gaze. Something powerful.
“The emergence of AI-generated talking faces provides a window into a future where technology amplifies the richness of human-human and human-AI interactions,” the preface to the paper detailing the results reads. It’s dark. “Such technologies have the potential to enrich digital communication, increase accessibility for people with communication disorders, transform teaching methods with interactive AI tutoring, and provide therapeutic support and social interaction in healthcare. It’s hidden.”
While all very commendable, the researchers also acknowledge the potential for abuse. Reading tons of online news on a daily basis already feels like an impossible task to weed out fact from outright fabrication, but it’s hard to believe that most people can make it seem like they’re saying whatever they want. Imagine having tools at your disposal that can.
Whether it’s playing a harmless prank on a relative over a FaceTime from your favorite Hollywood actor or pop star, implicating an innocent person in a serious crime by posting an online confession, or trying to trick someone into making money by impersonating your precious grandchild. There is a possibility of deception. For example, if you are facing an issue or a major politician has expressed support for a controversial topic. Realistic and convincing.
However, the content produced by the VASA-1 model “contains discernible artifacts,” and the researchers believe that “until we can ensure that the technology is used responsibly and in accordance with appropriate regulations. ” There are no plans to make the platform publicly available.
A paper detailing the project has been published on the arXiv server.
Source: Microsoft Research