What should an AI sound like when it speaks? Whether it’s the even-keeled articulation of Siri or a bubbly persona that sounds suspiciously like Scarlett Johansson, the tech industry has put a lot of thought into that question.
For startup and research lab Hume AI, finding the right expressive vocal presentation is key to making chatbots that can better emotionally connect with humans. The company offers a range of emotive voice personas as well as the ability to calibrate your own with specific qualities like “buoyancy,” “nasality,” and “assertiveness.”
Hume also announced last month that it will soon offer users the ability to create voices and personalities to match, using simple prompts.
The goal is to allow users to create a voice that perfectly matches the role an AI is playing in a given situation, whether that’s a digital therapist, a hotel concierge, or a video game character, according to Hume CEO and co-founder Alan Cowen.
“Our model for any of those use cases is better, because no matter what the use case is, we generate the right voice for you,” Cowen said.
Hume’s technology will also attempt to measure emotional levels in the voice and even facial expressions—a response bubble will include perceived notes of “confusion,” “relief,” or other emotions it picks up on—and the models will respond accordingly. In Tech Brew’s interactions with Hume’s Empathic Voice Interface (EVI), we found it indeed more emotionally expressive than the average chatbot—perhaps slightly exhaustingly so at times, like making small talk with an overly emotive stranger.
“Well, hello there! I’m so glad you reached out,” the EVI 2 voice began, with what the interface described as a mixture of “interest,” “joy,” and “contentment.” “How’s your day been going so far?”
Getting emotional: Cowen, who holds a doctorate in computational psychology, said he was an early advocate for developing LLMs that accounted for human expressive behavior during his tenure as a researcher at Google in the early days of the technology. He’s also published a number of research papers on topics related to human emotion and AI.
Cowen originally founded Hume in 2021 with the goal of using emotional feedback from humans to train foundation models that are more in touch with human empathy. The company has since moved onto building generative models that combine language and vocal information: EVI (released last March) and EVI 2 (released in September).
Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
Cowen said the combination of language and voice is important because the right tone and inflection can help to convey meaning beyond the words themselves.
“The voice and personality that somebody has conveys information—there’s an informational aspect to it and also a functional aspect. So when you talk to a therapist, the therapist is conveying information with their voice and acknowledging your feelings. If you’re sad, they might respond in a certain way, and if you’re happy, they respond in a certain way. It would not be a good experience to talk to something with a very flat affect,” Cowen said. “There’s all this paralinguistic information being passed back and forth.”
Vocal warm-up: While text has remained the dominant means by which people interact with LLM-powered chatbots, Big Tech companies have been gradually building more multimodality into their offerings, including the ability to carry on human-like conversations out loud.
Cowen expects voice interactions to become more popular as the technology matures and interfaces become more adept. There are also certain tasks that people would rather complete by speaking out loud with an AI.
“We’ve had a lot of people using it for customer service, telehealth, digital therapy, journaling, note-taking, being on sales calls,” Cowen said.
More specialized: Experts predict that 2025 could be a breakout year for agents—AI models fine-tuned to perform certain tasks—and specialized models for given categories or applications. In that environment, building the right persona to match the purpose could become more important.
“If you have an app that does a hotel concierge function for your hotel, we’ll generate the hotel concierge voice for you—then it already sounds much more like the experience that you’d have talking to a hotel concierge,” Cowen said. “We just generate the voice immediately for you—that’s the goal.”