Speech2Face is an advanced neural network developed by MIT scientists and trained to recognize certain facial features and reconstruct people’s faces just by listening to the sound of their voices.
You’ve probably already heard about AI-powered cameras that can recognize people just by analyzing their facial features, but what if there was a way for artificial intelligence to figure out what you look like just by the sound of your voice and without comparing your voice to a database? That’s exactly what a team of scientists at MIT has been working on, and the results of their work are impressive, kind of. While their neural network, named Speech2Face, can’t yet figure out the exact facial features of a human just by their voice, it certainly gets plenty of details right.
“Our model is designed to reveal statistical correlations that exist between facial features and voices of speakers in the training data,” the creators of Speech2Face said. “The training data we use is a collection of educational videos from YouTube, and does not represent equally the entire world population. Therefore, the model—as is the case with any machine learning model—is affected by this uneven distribution of data.”
You can tell a lot about a person from the way they speak alone. For example, you can most likely tell if someone is male o female, or if they are young or old, but Speech2Face goes beyond that. It can determine fairly accurately the shape of someone’s nose, cheekbones or jaw from their voice alone, because the way the nose and other bones in our faces are structured determines the way we sound.
Ethnicity is also one of the things Speech2Face can pinpoint with accuracy from listening to someone’s voice for just a few milliseconds, as people who come from the same groups tend to have similar attributes. The AI takes a variety of factors into account, and it sometimes produces impressive results, but it’s still a work in progress.
In some cases, the AI had difficulty determining what the speaker might look like. Factors such as accent, spoken language, and voice pitch caused speech-to-face gross mismatches in which gender, age, or ethnicity were completely incorrect. For example, men with a particularly high-pitched voice were often identified as female, while females with a deep voice were identified as male. Asian people speaking fluent English also looked less Asian than when speaking their native language.
“In a way, the system is a bit like your racist uncle. He feels like he can always tell a person’s race or ethnicity based on the way they speak, but he’s often wrong,” photographer Thomas Smith said about Speech2Face.
Still, despite its limitations, Speech2Face offers a look into the future of artificial intelligence technology that both impresses and terrifies most people. Imagine a future where only a few milliseconds of voice time are enough for a neural network to put together an accurate portrait. Sure, it could help identify criminals, but what’s to stop bad actors from using the same technology for nefarious purposes?