Full-stream TTS with dynamic speaking rate control
We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.
Check our previous work on low latency streaming TTS here.
The first row shows a gradual speed up or slowdown of speaking rate, and the second row shows how the model responds to sharp changes of the control signal.
| Speech Prompt | Text | VoiceStar | MaskGCT | VoXtream2 |
|---|---|---|---|---|
|
Slow ➔ Fast
|
Look, I took the wrong bus once, and it dropped me two stops late, so I walked the rest of the way, and I found a small cafe, that had decent tea. | |||
|
Fast ➔ Slow
|
So, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away. | |||
|
Normal ➔ Fast
|
Okay, so my brother called me at 7, he sounded tired, but he still joked around, and it made me laugh, even though my day was messy, and I was glad. | |||
|
Normal ➔ Slow
|
Right, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away, and it felt worth it. |
Prompt text masking allows for translingual (any language to English) voice cloning.
| Language | Speech Prompt | Text | VoXtream2 |
|---|---|---|---|
| Chinese | VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. | ||
| Hindi | Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. | ||
| Spanish | Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines. | ||
| Arabic | In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU. | ||
| French | It has long been argued that conversational agents must be able to generate speech incrementally. | ||
| Russian | Recent progress in neural text-to-speech (TTS) synthesis has led to highly natural and intelligible speech generation. | ||
| Swedish | However, most contemporary systems implicitly assume that speaking rate is static across an utterance, typically allowing only coarse, global control over speed. |
@inproceedings{torgashov2026voxtream,
title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026},
note={to appear},
url={https://arxiv.org/abs/2509.15969}
}
@article{torgashov2026voxtream2,
author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
title = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
journal = {arXiv:2603.13518},
year = {2026}
}