The Origins of Video-Chat Voice

Woman looking at computer screen with video chat
We sound strange on Zoom, Skype, and FaceTime. Why?Photograph from Shutterstock

One evening, my three roommates and I were scattered across our common area when one of our friends FaceTimed in. I wanted to hear how he was holding up, and my roommate put him on speakerphone—but, as his voice wobbled through, its muddiness irked me. The sound coming through the phone didn’t have his voice’s normal shape or elasticity. I felt deprived. We accept odd failures from our technology; I would never expect a real-time video to overcome the uncanniness inherent in turning three dimensions into two. But since our voices are invisible—just air—something in me figured they should travel better.

Days later, overhearing one of my roommates talking with his co-workers on a Google Hangouts call, I wondered again about video-chat voice—the slightly nasal approximation of ourselves that we hear whenever we’re filtered through videoconferencing technology. It’s how our Internet interlocutors sound whether they’re across the world or, these days, across town. It’s become the sound of social distancing. What accounts for its particular character?

To find out, I arranged a FaceTime call with Chris Kyriakakis, an electrical- and computer-engineering professor at the University of Southern California and a chief audio scientist for Syng, a loudspeaker company. Kyriakakis is an expert in the re-creation and perception of sound; he has worked on a multi-university research team aiming to digitally replicate the acoustics of Byzantine-era churches. When discussing the less exalted subject of video-chat voice, he explained that, to maximize the clarity of the sound that my computer sends, I should pay attention to two factors. One is the distance between my mouth and the microphone. If we met up in person, he said, and I sat six feet away from him, his brain would focus on my voice and filter out any background noise. But microphones do not hear as human ears do, and are not so discerning; they merely pick up on what’s loudest, and when a talker is far away, other sounds compete with that person’s voice. The key is to shorten the distance. Kyriakakis did so by wearing headphones and a mic; as a result, his transmitted voice had a little more warmth than mine, since I was using my laptop’s microphone and—not wanting to crowd the camera—sitting back in my chair.

The second factor is the room’s reverberation, which is determined by its physical volume and how absorptive its contents are. Your voice is never just your voice; even during an in-person conversation, one’s words take on the character of one’s environment. “The sound coming in my ears when you talk to me also comes from thousands of other directions as it bounces around the room,” Kyriakakis explained. “Our brain is kind of constantly analyzing that to get a sense of, Okay, we’re in my son’s bedroom versus the racquetball court.” A room with a lot of reflective surfaces, he said, would make me sound as if I were in the shower. Rugs, curtains, blankets, my sweatshirt—anything plush—would help to decrease how much my voice was rattling around, improving the fidelity of my transmission. Even people, being vessels of water, can absorb reverberations. This, Kyriakakis said, is why some orchestras offer cheap tickets to rehearsals: by filling up the room, they give the musicians a more accurate sense of what the auditorium will sound like on opening night. Concert-hall acousticians generally design with the assumption that an audience will be attending, though many concert halls’ padded seats are absorptive enough to make the sonic difference between a full house and a poorly attended performance negligible.

At home, though, living in a padded room or crowding your roommates around you to improve the sound of a Skype date may not be ideal. Instead, Kyriakakis has fashioned his living room into an environment where audio can be its best self by using perforated, multi-tiered wall panels that resemble a cluster of skyscrapers seen from above; they precisely absorb and diffuse sound. “I have an understanding wife, and I can put things on the walls that are artistic but have properties that kill all reflections and reverberation,” he said. I would have loved to see this arrangement for myself; unfortunately, on the day that we FaceTimed, Kyriakakis’s dog and dishwasher had sent him searching for quiet in his son’s room. It turned out, though, that this may have made our conversation sound more realistic. His son’s room was likely closer in size and reverberation to my bedroom, where I’d called from. According to Kyriakakis, creating a sense of sonic intimacy online is a matter not just of clarity but of similarity. To maximize the feeling of being in the same room, callers should speak from similarly reverberant spaces.

To reach a microphone, your voice need travel only a few feet (or, preferably, inches). The longer, transformative journey across the Internet is still to come—a voyage across an ever-changing terrain either smoothed or roughened by network bandwidth. Stephen Casner, one of the early pioneers of audio and video transmission on Internet-like networks, told me that, to make the trip, your voice must be shrunk and chopped up by what’s called a codec, into packets of sound. Each packet contains about twenty milliseconds of compressed audio—oohs, ahs, “s” sounds. It’s as if, instead of sending someone a written letter, you sent them a collection of sequenced postcards with individual syllables. These packets then zip to your conversational partners’ computers, where another codec remakes the sound just before it exits the speakers.

Sometimes, packets go missing. If you’re streaming a movie, software can prepare for this eventuality by “buffering,” creating a cushion of a few seconds that allows for retransmission of the missing components. The movie may not continue to play until every necessary droplet of the stream has arrived. But to avoid inserting awkward delays into our conversations, real-time audio software must press on with only the slightest pause to check for straggling packets. If they don’t arrive, so be it.

The question, then, is what is to be done to fill in the gaps. Videoconferencing technologies use voice codecs, which Timothy Terriberry, an engineer at Mozilla, told me are designed specifically to replicate the human vocal tract. (These human-centric algorithms are why instrumental music played through video chat can sound so terrible, Terriberry said.) We spoke through a voice-only chat on Zoom, which sometimes uses the protocol for audio codecs that Terriberry helped create. If a voice codec encounters what it thinks is a missing vowel, it may read what came before and after for clues. It then extends or inserts a flat tone, its best guess about how a human voice would fill the space. This can lead to an Auto-Tune-like effect, turning us into temporary T-Pains. Fricatives—the exhaled consonants in “fridge” and “thunk”—are especially challenging for voice codecs. They’re shorter and less repetitive, and therefore more likely to be lost. They’re also very difficult for computers to impersonate, which is one reason that a person’s video-chat voice often sounds like it’s briefly hissing or chirping.

After talking with Terriberry, I started listening for these quirks. When my two brothers and I caught up, and their voices occasionally cut out or sounded a touch robotic, I appreciated what the system had attempted. A listening device less precise than my ear had picked up a sound that had then got lost; to fill the void, a chorus of technologies had hustled to conjure imitations of my brothers. In a way, knowing about all that hidden effort made the imperfections more relatable than frustrating; the software’s little mistakes almost seemed like the sort that, if I’d made them, my brothers would have playfully mocked. Their video-chat voices sounded a little more human as they flowed into my room, bit by bit.