The AI system annotated TV footage with 46.8 percent accuracy
Researchers from Google’s AI division DeepMind and the University of Oxford have used artificial intelligence to create the most accurate lip-reading software ever.
Using thousands of hours of TV footage from the BBC, scientists trained
a neural network to annotate video footage with 46.8 percent accuracy.
That might not seem that impressive at first — especially compared to AI
accuracy rates when transcribing audio — but tested on the same
footage, a professional human lip-reader was only able to get the right
word 12.4 percent of the time.
The research follows similar work published a separate group at the University of Oxford earlier this month.
Using related techniques, these scientist were able to create a
lip-reading program called LipNet that achieved 93.4 percent accuracy in
tests, compared to 52.3 percent human accuracy.
However, LipNet was
only tested on specially-recorded footage that used volunteers speaking
formulaic sentences. By comparison, DeepMind’s software — known as
“Watch, Listen, Attend, and Spell” — was tested on far more challenging
footage; transcribing natural, unscripted conversations from BBC
politics shows.
More than 5,000 hours of footage from TV shows including Newsnight, Question Time, and the World Today,
was used to train DeepMind’s “Watch, Listen, Attend, and Spell”
program. The videos included 118,000 difference sentences and some
17,500 unique words, compared to LipNet’s test database of video of just
51 unique words.
DeepMind’s researchers suggest that the program could
have a host of applications, including helping hearing-impaired people
understand conversations. It could also be used to annotate silent
films, or allow you to control digital assistants like Siri or Alexa by
just mouthing words to a camera (handy if you’re using the program in
public).
But when most people learn that an AI program has learned
how to lip-read, their first thought is how it might be used for
surveillance. Researchers say that there’s still a big difference in
transcribing brightly-lit, high resolution TV footage, and grainy CCTV
video with a low frame rate, but you can’t ignore the fact, that
artificial intelligence seems to be closing this gap.
I really liked as a part of the article. With a nice and interesting topics
ReplyDeleteemployee uniform program