Phantom X 3.2 · Top-ranked expressive TTS among real-time models

Phantom X 3.2 Benchmark — English

In a blind English-language test against the top real-time TTS models, Phantom X 3.2 landed in the top tier — tied for #1 on expressivity, at ~125 ms latency.

Updated · May 2026

Methodology

Blind-tested against every real-time leader

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.


Result

Phantom X 3.2 ranked at the top

We benchmarked English specifically — the most saturated language in TTS, where every major model has been heavily optimized and quality gaps between leaders are vanishingly small. Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.

ELO Scores — English

Inworld #1 TTS 1.5-max
1549
Deepdub tied #1 Phantom X 3.2
1545
Hume Octave
1498
Async Flash v1.0
1493
ElevenLabs Turbo v2.5
1416
Average ELO rating · higher is better

Real-time latency

Real-time eTTS: 2× faster than the winner

In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — ~125 ms vs. Inworld TTS 1.5-max at ~250 ms.

Time-To-First-Audio · ms · lower is better

1 Deepdub Phantom X 3.2
125 ms
2 Async Flash v1.0
166 ms
3 Hume Octave
200 ms
4 Inworld TTS 1.5-Max
250 ms
5 ElevenLabs Turbo v2.5
300 ms
Time-to-first-audio (ms) · measured under identical conditions

Head-to-head matchups

Listeners prefer Phantom X 3.2 across the field

Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.

Phantom X 3.265%
ElevenLabs Turbo 2.535%
Phantom X 3.257.9%
Async Flash 1.042.1%
Phantom X 3.257.8%
Hume Octave42.2%
Phantom X 3.249.4%
Inworld TTS 1.5-Max50.6%
Blind pairwise listener preference · English

Expressivity

What makes Phantom X 3.2 actually emote

Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.

Emotional Layering
80+ emotion styles, from supportive to malicious. Inline tags let a single line shift tone mid-sentence, without sounding stitched together.
Paralinguistic Cues
Natural breath, pauses, and micro-shifts in tone. The small, unconscious signals that make speech sound spoken, not generated.
User
Hey — my flight just got cancelled and I have a meeting in Berlin tomorrow morning. I'm freaking out.
Agent (Phantom X 3.2)
supportiveHey, you've got this — we'll figure it out together. focusedI'm pulling alternates now. There's a 6:40 to Frankfurt with a connection that lands you at 09:15 — reassuringthat gets you in with time to spare.
~125 ms TTFA · 3 emotion shifts in one turn

Years of AI dubbing for the world's top streaming platforms set our standard for what 'human' sounds like. Phantom X 3.2 brings that gold standard to the agentic world: tied for #1 on expressivity among the world's top real-time TTS models, at ~125 ms latency, imperceptible to the listener.
Moshe Michelashvili
VP Research · Deepdub
Top-tier
expressive voice
100+
languages and dialects
125 ms
real-time latency

Proven at scale

From premium dubbing to live conversations

Deepdub is the gold standard in AI dubbing for premium media. For years, our eTTS has powered Hollywood-grade localization for the world's top streaming platforms, across thousands of drama series, feature films, and documentaries. Phantom X 3.2 brings that same gold standard to the agentic world, giving developers a foundation to deploy expressive, localized speech at scale.

300K+
minutes live on top world streaming platforms
Millions
of calls powered by Deepdub AI Voice
Ready when you are

Ready to hear Phantom X 3.2?

Drop into the playground — type a line, pick a voice, hear it in any of 100+ languages.

Try it in Playground