Microsoft AI researchers create hyper-realistic fake talking heads in real-time

A headshot and audio recording; that's all Microsoft researchers need to create lifelike avatars

Technology / news

A headshot and audio recording; that's all Microsoft researchers need to create lifelike avatars

22nd Apr 24, 12:05pm by Juha Saarinen

How Microsoft's VASA-1 works; Source: Microsoft Research Asia

Microsoft Research Asia scientists have come up with a way to generate "lifelike audio-driven talking faces" in real time, using just a single portrait photo and adding an audio track with speech for them.

Called VASA-1, the first iteration of the "visual affective skills" framework can produce very lifelike avatars - some would call them deepfakes - that emulate human conversational behaviours.

The researchers are able to generate 512 by 512 pixel videos at up to 40 frames per second, and a low, 170 millisecond startup latency.

Images can be likenesses of humans, or artificial intelligence generated portraits... or paintings, like Leonardo da Vinci's Mona Lisa:

This is diffusion AI model technology, that works by removing noise to generate the images. The researchers have made it possible to add controls to the talking heads, so as to give them human-like emotional behaviours, different poses, angles, and expressions.

Songs and other languages than English can also be used for VASA-1.

While the VASA-1 talking heads are not yet good enough to be easily mistaken for real people, the researchers are still concerned about the possibility that they could be abused.

Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behaviour to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there's still a gap to achieve the authenticity of real videos.

While acknowledging the possibility of misuse, it's imperative to recognise the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.

Due to the risk of abuse, the researchers say there won't be an online demo released, or an application programming interface, nor a product or similar implementation. Not until they are certain that the technology will be used responsibly, "and in accordance with proper regulations", that is.

Elsewhere, services that are able to clone voices quickly have already been released, such as Eleven Labs that handles multiple languages and accents.

We welcome your comments below. If you are not already registered, please register to comment

Remember we welcome robust, respectful and insightful debate. We don't welcome abusive or defamatory comments and will de-register those repeatedly making such comments. Our current comment policy is here.

8 Comments

by huttman | 22nd Apr 24, 12:47pm 1713746851

I won't be surprised one day a video phone call from your dear daughter, asking for money, is actually AI.

by ChrisOfNoFame | 22nd Apr 24, 1:53pm 1713750783

That's already here isn't it?

by nicko252 | 22nd Apr 24, 2:21pm 1713752516

It's literally here already. An example was posted a few weeks ago of voice sampling by Open AI called 'voice engine' and it's flawless. The only impediment would be the ability of a scammer to type fast enough to convert in real time to audio down the phone line, and to know enough about both the person being called, and the supposed caller, to not trigger suspicion. Of course there would only be suspicion in a world where the average person knows this is possible, which isn't the case right now. But it will be very soon. A lot of people are going to have to have convos with their elderly relatives sometime in the next couple of years about AI impersonation, and setting up verbal password etc.

by FCM | 22nd Apr 24, 3:14pm 1713755649

Yep so we may have to go full circle and actually have face to face contact again for banking etc..to prove it is that person.

by ChrisOfNoFame | 22nd Apr 24, 1:52pm 1713750753

They'll never be as much fun as Max Headroom.

by Sam B | 22nd Apr 24, 1:57pm 1713751037

This seems to present far more possibilities for misuse than for good use.

But, humans can't stop themselves from developing technology even when the danger is obvious. Goodbye, journalism, goodbye trust, goodbye expertise, goodbye society. C'est la vie.

by K.W. | 22nd Apr 24, 5:06pm 1713762396

This will enable huge productivity gains for TV News departments.

by Whatwillhappen | 23rd Apr 24, 2:29pm 1713839373

This reminds me of Alfred Nobel. Nobel invented dynamite for industrial purposes - it had so much more power than gunpowder.

He was also a pacifist, who did not want his inventions used for warfare, but for construction.

It's the classic ethic question: should a person think of how their research could be misused before they embark on it?