Submitted to Interspeech 2026

RobustSpeechFlow

Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

Anonymous Authors

text-to-speech zero-shot TTS flow matching alignment robustness contrastive learning
1.38%
WER on Seed-TTS-eval
1.44 → 1.38 (best across all systems)
0.06B
Parameters
5–25× smaller than competitors
27%
CER Reduction (EN)
0.48 → 0.35 at NFE=24
30%
CER Reduction (KO)
0.81 → 0.57 at NFE=24

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48% to 0.35% and Korean CER from 0.81% to 0.57%.

Quantitative Evaluation

We evaluate on the public Seed-TTS-eval benchmark and our newly constructed ZERO500 multilingual benchmark.

T1

Performance on Seed-TTS-eval Benchmark

Model Params WER ↓ SIM ↑
MegaTTS3 0.5B 2.79 0.77
Seed-TTSDiT 1.73 0.79
DiTAR 0.6B 1.69 0.74
MiniMax-Speech 1.65 0.69
F5-TTS 0.3B 2.00 0.67
CosyVoice3 1.5B 2.22 0.72
Spark-TTS 0.5B 3.14 0.57
OpenAudio S1-Mini 0.5B 1.94 0.55
IndexTTS2 1.5B 2.23 0.71
VibeVoice 1.5B 3.04 0.69
VoxCPM-Emilia 0.5B 2.34 0.68
VoxCPM 0.5B 1.85 0.73
Baseline 0.06B 1.44 0.60
ContrastiveFM 0.06B 1.41 0.60
RobustSpeechFlow 0.06B 1.38 0.60
T2

Results on ZERO500 at 500k Steps (%)

Model NFE EN CER ↓ EN WER ↓ KO CER ↓ KO WER ↓
Baseline 12 0.55 1.25 0.93 8.46
Baseline 24 0.48 1.18 0.81 8.40
ContrastiveFM 12 0.41 1.10 0.77 7.92
ContrastiveFM 24 0.39 1.06 0.65 7.72
RobustSpeechFlow 12 0.43 1.14 0.57 7.59
RobustSpeechFlow 24 0.35 1.03 0.57 7.45
F1

CER (%) over Training Steps on ZERO500

ZERO500-en CER at NFE=12
(a) ZERO500-en CER (%), NFE=12
ZERO500-en CER at NFE=24
(b) ZERO500-en CER (%), NFE=24
ZERO500-ko CER at NFE=12
(c) ZERO500-ko CER (%), NFE=12
ZERO500-ko CER at NFE=24
(d) ZERO500-ko CER (%), NFE=24

CER (%) over training steps on ZERO500. RobustSpeechFlow shows consistent improvement, especially on Korean and at NFE=24 where it reaches the lowest final CER. Legend: A = Baseline, B = ContrastiveFM, C = RobustSpeechFlow.

What Do Failure-Mode Negatives Sound Like?

RobustSpeechFlow constructs hard negatives by applying length-preserving repeat and skip augmentations to the ground-truth speech latent. Below we decode these augmented latents back to audio so you can hear what the model learns to avoid.

Repeat A source span is copied to a different position in the utterance, overwriting the target region and simulating word or phrase repetition. The overall length stays the same.
Skip The subsequent latent sequence is shifted forward to overwrite the skipped region, and the remaining tail frames are replaced with a silence latent. This simulates a local skip while preserving the global duration.
1

Augmentation Example 1

Original Text (English)
"I would have won the Junior Olympics if not for those medaling kids."
Original Clean ground-truth speech latent decoded to audio.
Repeat Augmented A source span overwrites a different position — sounds like a word/phrase is repeated.
Skip Augmented A span is removed by shifting later content forward; tail is padded with silence.
2

Augmentation Example 2

Original Text (Korean)
"파이와 아이를 곱한 수로 이를 거듭제곱해서 이를 더하면 영이 된다. 끝없이 순환하는 수와 정체를 나타내지 않는 수가 간결한 궤적을 그리며 한 점에 착지하고 이곳에 일을 더하는 순간 세계는 평화로워진다."
Original Clean ground-truth speech latent decoded to audio.
Repeat Augmented A source span overwrites a different position — sounds like a word/phrase is repeated.
Skip Augmented A span is removed by shifting later content forward; tail is padded with silence.

Listening Samples

Compare synthesized speech across Baseline, ContrastiveFM, and RobustSpeechFlow. Each sample shows the ASR-transcribed text and WER/CER for objective comparison.

All
Seed-TTS-eval
ZERO500-en
ZERO500-ko
Seed-TTS-eval

Public Zero-Shot Benchmark (English)

1 Seed-TTS-eval Sample 1

Input Text

"Compressed Natural Gas is a domestic energy produced in Western parts of India."

Reference
Baseline WER: 7.7%
Transcribed "Compressed natural gas is a domestic energy in western parts of India."
ContrastiveFM WER: 0.0%
Transcribed "Compressed natural gas is a domestic energy produced in western parts of India."
RobustSpeechFlow WER: 0.0%
Transcribed "Compressed natural gas is a domestic energy produced in western parts of India."

2 Seed-TTS-eval Sample 2

Input Text

"The rest of the money can be based on playing time."

Reference
Baseline WER: 0.0%
Transcribed "The rest of the money can be based on playing time."
ContrastiveFM WER: 9.1%
Transcribed "The rest of the money can be... on playing time."
RobustSpeechFlow WER: 0.0%
Transcribed "The rest of the money can be based on playing time."

3 Seed-TTS-eval Sample 3

Input Text

"Babele is one of the most popular tourist destinations in the country."

Reference
Baseline WER: 8.3%
Transcribed "Bab-al-Ayi is one of the most popular tourist destinations in the country."
ContrastiveFM WER: 0.0%
Transcribed "Babeli is one of the most popular tourist destinations in the country."
RobustSpeechFlow WER: 0.0%
Transcribed "Babeli is one of the most popular tourist destinations in the country."
ZERO500-en

Multilingual Benchmark (English)

All reference voices are in-the-wild recordings used exclusively for evaluation and are never included in model training data.

1 ZERO500-en Sample 1

Input Text

"Well, I kept rehearsing what to say, and it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I’ll try again with a clearer, calmer tone."

Reference
Baseline WER: 17.9%
Transcribed "And it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I'll try again with a clearer, calmer tone."
ContrastiveFM WER: 0.0%
Transcribed "Well, I kept rehearsing what to say, and it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I'll try again with a clearer, calmer tone."
RobustSpeechFlow WER: 0.0%
Transcribed "Well, I kept rehearsing what to say, and it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I'll try again with a clearer, calmer tone."

2 ZERO500-en Sample 2

Input Text

"The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."

Reference
Baseline WER: 13.3%
Transcribed "The script sounds natural, but it drifts in the third sentence and feels rushed."
ContrastiveFM WER: 0.0%
Transcribed "The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."
RobustSpeechFlow WER: 0.0%
Transcribed "The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."

3 ZERO500-en Sample 3

Input Text

"Well, that was unexpectedly stressful today."

Reference
Baseline WER: 16.7%
Transcribed "well, was unexpectedly stressful today."
ContrastiveFM WER: 0.0%
Transcribed "Well, that was unexpectedly stressful today."
RobustSpeechFlow WER: 0.0%
Transcribed "Well, that was unexpectedly stressful today."
ZERO500-ko

Multilingual Benchmark (Korean)

All reference voices are in-the-wild recordings used exclusively for evaluation and are never included in model training data.

1 ZERO500-ko Sample 1

Input Text

"음, 빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지, 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도, 추정이라도 말해 주시면 원인 범위를 줄일 수 있습니다."

Reference
Baseline CER: 7.4%
Transcribed "빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도 말해주시면 원인 범위를 줄일 수 있습니다."
ContrastiveFM CER: 0.2%
Transcribed "음, 빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도 추정이라도 말해주시면 원인 범위를 줄일 수 있습니다."
RobustSpeechFlow CER: 0.2%
Transcribed "음, 빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도 추정이라도 말해주시면 원인 범위를 줄일 수 있습니다."

2 ZERO500-ko Sample 2

Input Text

"음, 네 메시지는 다 읽었어. 지금 바로 답하면 또 말이 꼬일 것 같아, 그래서 잠깐만 숨 고르고 올게. 조금만 기다려 주면, 차분히 이야기하자."

Reference
Baseline CER: 7.5%
Transcribed "응, 내 메시지는 다 읽었어. 바로 답하면 또 말이 꼬일 것 같아. 그래서 잠깐만 숨 고르고 올게. 조금만 기다려주면 차분히 이야기하자."
ContrastiveFM CER: 3.8%
Transcribed "응. 내 메시지는 다 읽었어. 지금 바로 답하면 또 말이 꼬일 것 같아. 그래서 잠깐만 숨 고르고 올게. 조금만 기다려주면 차분히 이야기하자."
RobustSpeechFlow CER: 1.9%
Transcribed "음 내 메시지는 다 읽었어 지금 바로 답하면 또 말이 꼬일 것 같아 그래서 잠깐만 숨 고르고 올게 조금만 기다려주면 차분히 이야기하자"

3 ZERO500-ko Sample 3

Input Text

"주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."

Reference
Baseline CER: 0.0%
Transcribed "주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."
ContrastiveFM CER: 6.7%
Transcribed "주소 변경은 출고 저지만 가능하며, 이후에는 배송사로 문의해야 합니다."
RobustSpeechFlow CER: 0.0%
Transcribed "주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."

ZERO500 Evaluation Benchmark

We release ZERO500, a multilingual benchmark designed to stress alignment under diverse speaker, prosody, and text conditions. Each set contains 500 text prompts paired with 50 unique reference voices drawn from game, news, and conversational speech domains. Each reference voice is randomly paired with 10 sentences, and each pair is synthesized twice with different random seeds. All reference voices are in-the-wild recordings used exclusively for evaluation and are never included in model training data. Only text prompts are provided below; reference audio is not publicly released.

ZERO500-en (English)

500 English text prompts covering diverse domains, phonetic challenges, rare words, and complex syntactic structures for robust TTS benchmarking.

500 sentences CSV format
Download ZERO500-en

ZERO500-ko (Korean)

500 Korean text prompts with varied prosodic patterns, foreign loan words, and domain-diverse content for evaluating multilingual TTS robustness.

500 sentences CSV format
Download ZERO500-ko