RobustSpeechFlow

Abstract

While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48% to 0.35% and Korean CER from 0.81% to 0.57%.

Experimental Results

Quantitative Evaluation

We evaluate on the public Seed-TTS-eval benchmark and our newly constructed ZERO500 multilingual benchmark.

Performance on Seed-TTS-eval Benchmark

Model	Params	WER ↓	SIM ↑
MegaTTS3	0.5B	2.79	0.77
Seed-TTS_DiT	—	1.73	0.79
DiTAR	0.6B	1.69	0.74
MiniMax-Speech	—	1.65	0.69
F5-TTS	0.3B	2.00	0.67
CosyVoice3	1.5B	2.22	0.72
Spark-TTS	0.5B	3.14	0.57
OpenAudio S1-Mini	0.5B	1.94	0.55
IndexTTS2	1.5B	2.23	0.71
VibeVoice	1.5B	3.04	0.69
VoxCPM-Emilia	0.5B	2.34	0.68
VoxCPM	0.5B	1.85	0.73

Baseline	0.06B	1.44	0.60
ContrastiveFM	0.06B	1.41	0.60
RobustSpeechFlow	0.06B	1.38	0.60

Results on ZERO500 at 500k Steps (%)

Model	NFE	EN CER ↓	EN WER ↓	KO CER ↓	KO WER ↓
Baseline	12	0.55	1.25	0.93	8.46
Baseline	24	0.48	1.18	0.81	8.40
ContrastiveFM	12	0.41	1.10	0.77	7.92
ContrastiveFM	24	0.39	1.06	0.65	7.72

RobustSpeechFlow	12	0.43	1.14	0.57	7.59
RobustSpeechFlow	24	0.35	1.03	0.57	7.45

CER (%) over Training Steps on ZERO500

(a) ZERO500-en CER (%), NFE=12

(b) ZERO500-en CER (%), NFE=24

(d) ZERO500-ko CER (%), NFE=24

CER (%) over training steps on ZERO500. RobustSpeechFlow shows consistent improvement, especially on Korean and at NFE=24 where it reaches the lowest final CER. Legend: A = Baseline, B = ContrastiveFM, C = RobustSpeechFlow.

Hard Negative Augmentation

What Do Failure-Mode Negatives Sound Like?

RobustSpeechFlow constructs hard negatives by applying length-preserving repeat and skip augmentations to the ground-truth speech latent. Below we decode these augmented latents back to audio so you can hear what the model learns to avoid.

Repeat A source span is copied to a different position in the utterance, overwriting the target region and simulating word or phrase repetition. The overall length stays the same.

Skip The subsequent latent sequence is shifted forward to overwrite the skipped region, and the remaining tail frames are replaced with a silence latent. This simulates a local skip while preserving the global duration.

Augmentation Example 1

Original Text (English)

"I would have won the Junior Olympics if not for those medaling kids."

Original

Clean ground-truth speech latent decoded to audio.

Repeat Augmented

A source span overwrites a different position — sounds like a word/phrase is repeated.

Skip Augmented

A span is removed by shifting later content forward; tail is padded with silence.

Augmentation Example 2

Original Text (Korean)

"파이와 아이를 곱한 수로 이를 거듭제곱해서 이를 더하면 영이 된다. 끝없이 순환하는 수와 정체를 나타내지 않는 수가 간결한 궤적을 그리며 한 점에 착지하고 이곳에 일을 더하는 순간 세계는 평화로워진다."

Original

Clean ground-truth speech latent decoded to audio.

Repeat Augmented

A source span overwrites a different position — sounds like a word/phrase is repeated.

Skip Augmented

A span is removed by shifting later content forward; tail is padded with silence.

Audio Demos

Listening Samples

Compare synthesized speech across Baseline, ContrastiveFM, and RobustSpeechFlow. Each sample shows the ASR-transcribed text and WER/CER for objective comparison.

All

Seed-TTS-eval

ZERO500-en

ZERO500-ko

Seed-TTS-eval

Public Zero-Shot Benchmark (English)

1 Seed-TTS-eval Sample 1

Input Text

"Compressed Natural Gas is a domestic energy produced in Western parts of India."

Reference

Baseline WER: 7.7%

Transcribed "Compressed natural gas is a domestic energy in western parts of India."

ContrastiveFM WER: 0.0%

Transcribed "Compressed natural gas is a domestic energy produced in western parts of India."

RobustSpeechFlow WER: 0.0%

Transcribed "Compressed natural gas is a domestic energy produced in western parts of India."

2 Seed-TTS-eval Sample 2

Input Text

"The rest of the money can be based on playing time."

Reference

Baseline WER: 0.0%

Transcribed "The rest of the money can be based on playing time."

ContrastiveFM WER: 9.1%

Transcribed "The rest of the money can be... on playing time."

RobustSpeechFlow WER: 0.0%

Transcribed "The rest of the money can be based on playing time."

3 Seed-TTS-eval Sample 3

Input Text

"Babele is one of the most popular tourist destinations in the country."

Reference

Baseline WER: 8.3%

Transcribed "Bab-al-Ayi is one of the most popular tourist destinations in the country."

ContrastiveFM WER: 0.0%

Transcribed "Babeli is one of the most popular tourist destinations in the country."

RobustSpeechFlow WER: 0.0%

Transcribed "Babeli is one of the most popular tourist destinations in the country."

ZERO500-en

Multilingual Benchmark (English)

All reference voices are in-the-wild recordings used exclusively for evaluation and are never included in model training data.

1 ZERO500-en Sample 1

Input Text

"Well, I kept rehearsing what to say, and it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I’ll try again with a clearer, calmer tone."

Reference

Baseline WER: 17.9%

Transcribed "And it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I'll try again with a clearer, calmer tone."

ContrastiveFM WER: 0.0%

Transcribed "Well, I kept rehearsing what to say, and it sounded fine in my head. Then I heard my own voice shake, and I panicked a little. Give me a moment, and I'll try again with a clearer, calmer tone."

RobustSpeechFlow WER: 0.0%

2 ZERO500-en Sample 2

Input Text

"The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."

Reference

Baseline WER: 13.3%

Transcribed "The script sounds natural, but it drifts in the third sentence and feels rushed."

ContrastiveFM WER: 0.0%

Transcribed "The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."

RobustSpeechFlow WER: 0.0%

Transcribed "The script sounds natural, but the rhythm drifts in the third sentence and feels rushed."

3 ZERO500-en Sample 3

Input Text

"Well, that was unexpectedly stressful today."

Reference

Baseline WER: 16.7%

Transcribed "well, was unexpectedly stressful today."

ContrastiveFM WER: 0.0%

Transcribed "Well, that was unexpectedly stressful today."

RobustSpeechFlow WER: 0.0%

Transcribed "Well, that was unexpectedly stressful today."

ZERO500-ko

Multilingual Benchmark (Korean)

All reference voices are in-the-wild recordings used exclusively for evaluation and are never included in model training data.

1 ZERO500-ko Sample 1

Input Text

"음, 빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지, 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도, 추정이라도 말해 주시면 원인 범위를 줄일 수 있습니다."

Reference

Baseline CER: 7.4%

Transcribed "빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도 말해주시면 원인 범위를 줄일 수 있습니다."

ContrastiveFM CER: 0.2%

Transcribed "음, 빠르게 몇 가지 확인 질문만 드릴게요. 업데이트 이후에 문제가 시작됐는지 아니면 아무 변화 없이 갑자기 생겼는지 기억나세요? 정확하지 않아도 추정이라도 말해주시면 원인 범위를 줄일 수 있습니다."

RobustSpeechFlow CER: 0.2%

2 ZERO500-ko Sample 2

Input Text

"음, 네 메시지는 다 읽었어. 지금 바로 답하면 또 말이 꼬일 것 같아, 그래서 잠깐만 숨 고르고 올게. 조금만 기다려 주면, 차분히 이야기하자."

Reference

Baseline CER: 7.5%

Transcribed "응, 내 메시지는 다 읽었어. 바로 답하면 또 말이 꼬일 것 같아. 그래서 잠깐만 숨 고르고 올게. 조금만 기다려주면 차분히 이야기하자."

ContrastiveFM CER: 3.8%

Transcribed "응. 내 메시지는 다 읽었어. 지금 바로 답하면 또 말이 꼬일 것 같아. 그래서 잠깐만 숨 고르고 올게. 조금만 기다려주면 차분히 이야기하자."

RobustSpeechFlow CER: 1.9%

Transcribed "음 내 메시지는 다 읽었어 지금 바로 답하면 또 말이 꼬일 것 같아 그래서 잠깐만 숨 고르고 올게 조금만 기다려주면 차분히 이야기하자"

3 ZERO500-ko Sample 3

Input Text

"주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."

Reference

Baseline CER: 0.0%

Transcribed "주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."

ContrastiveFM CER: 6.7%

Transcribed "주소 변경은 출고 저지만 가능하며, 이후에는 배송사로 문의해야 합니다."

RobustSpeechFlow CER: 0.0%

Transcribed "주소 변경은 출고 전까지만 가능하며, 이후에는 배송사로 문의해야 합니다."

Quantitative Evaluation

Performance on Seed-TTS-eval Benchmark

Results on ZERO500 at 500k Steps (%)

CER (%) over Training Steps on ZERO500

What Do Failure-Mode Negatives Sound Like?

Augmentation Example 1

Augmentation Example 2

Listening Samples

Public Zero-Shot Benchmark (English)

1 Seed-TTS-eval Sample 1

2 Seed-TTS-eval Sample 2

3 Seed-TTS-eval Sample 3

Multilingual Benchmark (English)

1 ZERO500-en Sample 1

2 ZERO500-en Sample 2

3 ZERO500-en Sample 3

Multilingual Benchmark (Korean)

1 ZERO500-ko Sample 1

2 ZERO500-ko Sample 2

3 ZERO500-ko Sample 3

ZERO500 Evaluation Benchmark

ZERO500-en (English)

ZERO500-ko (Korean)