I Ryu/Visual China Group/Getty Images
A Microsoft sign is seen at its headquarters in Seattle, Washington on March 19, 2023.
new york
CNN
—
Thanks to new artificial intelligence technology from Microsoft, the Mona Lisa can now do more than just smile.
Last week, Microsoft researchers detailed a new AI model they developed that can take a still image of a face and an audio clip of a person speaking and automatically create a realistic video of that person speaking. . This video can be created from photorealistic faces as well as cartoons and artwork, with convincing lip sync and natural facial and head movements.
In one demo video, researchers showed an animated Mona Lisa reciting a comedic rap by actor Anne Hathaway.
The output from the AI model, called VASA-1, is both interesting and a bit jarring in its realism. Microsoft said the technology could be used in education, “improving accessibility for individuals with communication difficulties,” or even creating virtual companions for humans. However, it is also easy to see how this tool could be misused and used to impersonate a real person.
This is a concern that extends beyond Microsoft. As more tools emerge to create convincing AI-generated images, videos and audio, experts worry that their misuse could lead to new forms of misinformation. Others fear the technology could further disrupt creative industries, from film to advertising.
Microsoft said it currently has no plans to release the VASA-1 model to the public any time soon. The move is similar to how Microsoft partner OpenAI is addressing concerns about its AI-generated video tool Sora: OpenAI teased Sora in February, but so far it has only been using it for testing purposes. It is made available only to departmental professional users and cybersecurity professors.
“We oppose any activity that creates misleading or harmful content about real people,” Microsoft researchers said in a blog post. However, it added that the company “does not plan to release” its products publicly “until we are confident that the technology will be used responsibly and in accordance with appropriate regulations.”
Microsoft’s new AI model is trained on large numbers of videos of people’s faces talking, and captures natural facial and head movements, including “lip movements, (non-lip) facial expressions, gaze, and blinking, among other things.” The researchers said it was designed to recognize. When VASA-1 animates still photos, you get more life-like videos.
For example, in a demo video set with a clip of someone who appears to be playing a video game sounding excited, the speaking face has furrowed brows and pursed lips.
AI tools can also be instructed to generate videos in which a subject looks in a certain direction or expresses a certain emotion.
If you look closely, you can still see signs that the video was machine-generated, such as rare eye blinks and exaggerated eyebrow movements. But Microsoft said it believes its model “significantly outperforms” other similar tools and “paves the way for real-time engagement with lifelike avatars that emulate human conversational behavior.” Ta.