Video: Google and MIT's AI can fix your phone snaps in real time
Google researchers have developed a deep-learning audio-visual model that can isolate one speaker's voice in a cacophony of noise.
The 'cocktail party effect' -- the ability to mute all voices in a crowd and focus on a single person's voice -- comes easily to humans but not machines.
It's an obstacle to an application of the Google Glass smart glasses that I personally would like to see developed one day. That is, as a real-time speech-recognition and live-transcription system to support hearing-aid wearers.
Hearing aids help my wife hear better, but often with so much static and crackling she's mostly lip-reading. At the age of 15, she could hear perfectly. Now her hearing is getting worse each year and will almost certainly disappear at some point in future. Only stem-cell magic could reverse the situation.
I thought my Glass idea was a great fallback until I wondered how it would pick the right voice out of a crowd -- the scenario she finds it hardest to hear in -- to live-transcribe the target speaker.
Apparently voice separation is a hard nut to crack, but Google's AI researchers may have a part of the answer to my Glass dream in the form of a deep-learning audio-visual model that can isolate speech from a mixture of sounds.
The scenario they present are two speakers standing side-by-side jabbering simultaneously. The technique hasn't been proven in a real-world crowd but it does work on a video with two speakers on a single audio track.
Video: Google's research combines the auditory and visual signals to separate speakers.