VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov Gustav Eje Henter Gabriel Skantze

Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden

Overview

We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.

Dynamic speed control: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
Streaming performance: Works 4x times faster than real-time and achieves 74 ms first packet latency in a full-stream on a consumer GPU.
Translingual capability: Prompt text masking enables support of acoustic prompts in any language.

Check our previous work on low latency streaming TTS here.

Dynamic speaking-rate control

The first row shows a gradual speed up or slowdown of speaking rate, and the second row shows how the model responds to sharp changes of the control signal.

Acoustic prompt

Well. My commute started out normal, then it got weird. A driver saw me running, but the door stayed closed. By then I was not angry, just tired of guessing the timing. I waited next to a man giving directions to everyone. The board changed twice before I reached the stop.

Acoustic prompt

Here is the thing. Tonight I made a simple meal, at least that was the plan. I tasted the soup too early and added salt too soon. I kept checking the stove like it might change by itself. Then I had to balance it out with more water. The pot looked messy at first, but it slowly came together.

Acoustic prompt

Okay, so. The meeting started on time and ended nowhere near on time. One person shared a screen with too many tabs open. Then we spent twenty minutes arguing about names for files. Everyone had a different version of what was agreed.

Acoustic prompt

All right, so. We both liked the movie, but we did not agree on why. I cared about the pacing and my friend cared about the music. It was more about how it felt than what actually happened. The funny part is that we agreed on the ending the whole time.

Static speaking-rate control

Speech Prompt	Text	VoiceStar	MaskGCT	VoXtream2
Slow ➔ Fast	Look, I took the wrong bus once, and it dropped me two stops late, so I walked the rest of the way, and I found a small cafe, that had decent tea.
Fast ➔ Slow	So, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away.
Normal ➔ Fast	Okay, so my brother called me at 7, he sounded tired, but he still joked around, and it made me laugh, even though my day was messy, and I was glad.
Normal ➔ Slow	Right, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away, and it felt worth it.

Translingual voice cloning

Prompt text masking allows for translingual (any language to English) voice cloning.

Language	Speech Prompt	Text	VoXtream2
Chinese		VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality.
Hindi		Prompt-text masking enables textless audio prompting, removing the need for prompt transcription.
Spanish		Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines.
Arabic		In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
French		It has long been argued that conversational agents must be able to generate speech incrementally.
Russian		Recent progress in neural text-to-speech (TTS) synthesis has led to highly natural and intelligible speech generation.
Swedish		However, most contemporary systems implicitly assume that speaking rate is static across an utterance, typically allowing only coarse, global control over speed.

Citation

@inproceedings{torgashov2026voxtream,
  title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  note={to appear},
  url={https://arxiv.org/abs/2509.15969}
}

@article{torgashov2026voxtream2,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
  journal   = {arXiv:2603.13518},
  year      = {2026}
}