In a blind English-language test against the top real-time TTS models, Phantom X 3.2 landed in the top tier — tied for #1 on expressivity, at ~125 ms latency.
To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.
We benchmarked English specifically — the most saturated language in TTS, where every major model has been heavily optimized and quality gaps between leaders are vanishingly small. Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.
In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — ~125 ms vs. Inworld TTS 1.5-max at ~250 ms.
Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.
Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.
Deepdub is the gold standard in AI dubbing for premium media. For years, our eTTS has powered Hollywood-grade localization for the world's top streaming platforms, across thousands of drama series, feature films, and documentaries. Phantom X 3.2 brings that same gold standard to the agentic world, giving developers a foundation to deploy expressive, localized speech at scale.
Drop into the playground — type a line, pick a voice, hear it in any of 100+ languages.