Google DeepMind researchers just created an AI that can lip read better than humans

Researchers at Google’s Deep Mind AI division – the same division that birthed the Go-playing AlphaGo – have just created what they say is the most accurate lip-reading software to date.

By Koh Wanzi - 25 Nov 2016

Artificial intelligence is on a tear. Researchers at Google’s Deep Mind AI division – the same division that birthed the Go-playing AlphaGo – have just created what they say is the most accurate lip-reading software to date.

The software leverages an artificial neural network that learned how to lip read by watching over 5,000 hours of TV footage from the BBC, and the researchers were able to train it to annotate video with an impressive 46.8 per cent accuracy.

That may seem a bit lackluster at first, but a professional human lip-reader was only able to get the right word 12.4 per cent of the time when given the same footage.

DeepMind’s software, dubbed “Watch, Listen, Attend, and Spell”, is particularly impressive because it was tested on natural, unscripted conversations from BBC politics shows like Question Time, Newsnight, and the World Today.

In comparison, similar work done by a different group at the University of Oxford this month worked with specially recorded footage featuring volunteers who spoke sentences with fixed structures. Their program, LipNet, managed a 93.4 per cent accuracy, compared to 52.3 per cent for humans.

However, the video DeepMind researchers used included 118,000 different sentences and around 17,500 unique words, while LipNet’s test database had just 51 unique words.

Obvious applications for the software include helping hearing-impaired people understand conversations, or annotating films and video. That said, these developments may also raise the specter of public surveillance, where an AI program could be used to tease out what people were saying on security footage.

And while researchers say there is still quite a long way to go between transcribing well-lit, high resolution TV footage and grainy CCTV video with low frame rates, it doesn’t change the fact that it could one day be possible. Google also recently announced a switch to a new Google Neural Machine Translation (GNMT) system that significantly improved language translation quality and capability, yet another testament to the growing ability of AI to parse our communications.

Source: arXiv via New Scientist

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.

Google DeepMind researchers just created an AI that can lip read better than humans

Tags

Share this article