Phantom X 3.2 Benchmark — English

Methodology

Blind-tested against every real-time leader

To see how we stack up against the industry's top real-time engines, we ran a rigorous blind listening study in English, comparing Phantom X 3.2 with Inworld, Hume, Async, and ElevenLabs. Linguistic experts conducted thousands of blind pairwise comparisons. The results show Phantom X 3.2 sets a new bar for expressivity and quality at exceptionally low latency.

Result

Phantom X 3.2 ranked at the top

We benchmarked English specifically — the most saturated language in TTS, where every major model has been heavily optimized and quality gaps between leaders are vanishingly small. Expressive performance was our primary metric, judged on sound quality, prosody, intonation, and absence of artifacts.

ELO Scores — English

Inworld #1 TTS 1.5-max

1549

Deepdub tied #1 Phantom X 3.2

1545

Hume Octave

1498

Async Flash v1.0

1493

ElevenLabs Turbo v2.5

1416

Average ELO rating · higher is better

Real-time latency

Real-time eTTS: 2× faster than the winner

In real-time conversational AI, latency is the ultimate barrier to immersion. Our Time-To-First-Audio (TTFA) optimization ensures Phantom X 3.2 responds before the human listener can perceive a delay — ~125 ms vs. Inworld TTS 1.5-max at ~250 ms.

Time-To-First-Audio · ms · lower is better

1 Deepdub Phantom X 3.2

125 ms

2 Async Flash v1.0

166 ms

3 Hume Octave

200 ms

4 Inworld TTS 1.5-Max

250 ms

5 ElevenLabs Turbo v2.5

300 ms

Time-to-first-audio (ms) · measured under identical conditions

Head-to-head matchups

Listeners prefer Phantom X 3.2 across the field

Our model finished in a statistical tie for first place and decisively outperformed every other competitor tested. The result: Deepdub Phantom X 3.2 sits at the very top of the industry on the only benchmark that matters — what real people actually prefer to hear.

Phantom X 3.265%

ElevenLabs Turbo 2.535%

Phantom X 3.257.9%

Async Flash 1.042.1%

Phantom X 3.257.8%

Hume Octave42.2%

Phantom X 3.249.4%

Inworld TTS 1.5-Max50.6%

Blind pairwise listener preference · English

Expressivity

What makes Phantom X 3.2 actually emote

Two systems work together: a wide emotion library that you control inline in the script, and a paralinguistic layer that adds the small unconscious signals listeners associate with real speech.

Emotional Layering

80+ emotion styles, from supportive to malicious. Inline tags let a single line shift tone mid-sentence, without sounding stitched together.

Paralinguistic Cues

Natural breath, pauses, and micro-shifts in tone. The small, unconscious signals that make speech sound spoken, not generated.

User

Hey — my flight just got cancelled and I have a meeting in Berlin tomorrow morning. I'm freaking out.

Agent (Phantom X 3.2)

supportiveHey, you've got this — we'll figure it out together. focusedI'm pulling alternates now. There's a 6:40 to Frankfurt with a connection that lands you at 09:15 — reassuringthat gets you in with time to spare.

~125 ms TTFA · 3 emotion shifts in one turn

Years of AI dubbing for the world's top streaming platforms set our standard for what 'human' sounds like. Phantom X 3.2 brings that gold standard to the agentic world: tied for #1 on expressivity among the world's top real-time TTS models, at ~125 ms latency, imperceptible to the listener.

Moshe Michelashvili

VP Research · Deepdub

Top-tier

expressive voice

100+

languages and dialects

125 ms

real-time latency

Proven at scale

From premium dubbing to live conversations

Deepdub is the gold standard in AI dubbing for premium media. For years, our eTTS has powered Hollywood-grade localization for the world's top streaming platforms, across thousands of drama series, feature films, and documentaries. Phantom X 3.2 brings that same gold standard to the agentic world, giving developers a foundation to deploy expressive, localized speech at scale.

300K+

minutes live on top world streaming platforms

Millions

of calls powered by Deepdub AI Voice

Ready when you are

Ready to hear Phantom X 3.2?

Drop into the playground — type a line, pick a voice, hear it in any of 100+ languages.

Try it in Playground