VoXtream2

Full-stream TTS with dynamic speaking rate control

View the Project on GitHub herimor/voxtream2

VoXtream2: Full-stream TTS with dynamic speaking rate control

Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden

Overview

We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly.

  • Dynamic speed control: Distribution matching and Classifier-free guidance allow for a fine-grained speaking rate control, which can be adjusted as the model generates speech.
  • Streaming performance: Works 4x times faster than real-time and achieves 74 ms first packet latency in a full-stream on a consumer GPU.
  • Translingual capability: Prompt text masking enables support of acoustic prompts in any language.

Check our previous work on low latency streaming TTS here.

Architecture

Dynamic speaking-rate control

The first row shows a gradual speed up or slowdown of speaking rate, and the second row shows how the model responds to sharp changes of the control signal.

Sequential speaking-rate control
Acoustic prompt
Well. My commute started out normal, then it got weird. A driver saw me running, but the door stayed closed. By then I was not angry, just tired of guessing the timing. I waited next to a man giving directions to everyone. The board changed twice before I reached the stop.
Acoustic prompt
Here is the thing. Tonight I made a simple meal, at least that was the plan. I tasted the soup too early and added salt too soon. I kept checking the stove like it might change by itself. Then I had to balance it out with more water. The pot looked messy at first, but it slowly came together.
Interleaved speaking-rate control
Acoustic prompt
Okay, so. The meeting started on time and ended nowhere near on time. One person shared a screen with too many tabs open. Then we spent twenty minutes arguing about names for files. Everyone had a different version of what was agreed.
Acoustic prompt
All right, so. We both liked the movie, but we did not agree on why. I cared about the pacing and my friend cared about the music. It was more about how it felt than what actually happened. The funny part is that we agreed on the ending the whole time.

Static speaking-rate control

Speech Prompt Text VoiceStar MaskGCT VoXtream2
Slow ➔ Fast

Look, I took the wrong bus once, and it dropped me two stops late, so I walked the rest of the way, and I found a small cafe, that had decent tea.


Fast ➔ Slow

So, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away.


Normal ➔ Fast

Okay, so my brother called me at 7, he sounded tired, but he still joked around, and it made me laugh, even though my day was messy, and I was glad.


Normal ➔ Slow

Right, my cat stared at the door, like it heard something outside, so I opened the window, and it just sniffed the air, then walked away, and it felt worth it.


Translingual voice cloning

Prompt text masking allows for translingual (any language to English) voice cloning.

Language Speech Prompt Text VoXtream2
Chinese
VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality.
Hindi
Prompt-text masking enables textless audio prompting, removing the need for prompt transcription.
Spanish
Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines.
Arabic
In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
French
It has long been argued that conversational agents must be able to generate speech incrementally.
Russian
Recent progress in neural text-to-speech (TTS) synthesis has led to highly natural and intelligible speech generation.
Swedish
However, most contemporary systems implicitly assume that speaking rate is static across an utterance, typically allowing only coarse, global control over speed.

Citation

@inproceedings{torgashov2026voxtream,
  title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026},
  note={to appear},
  url={https://arxiv.org/abs/2509.15969}
}

@article{torgashov2026voxtream2,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream2: Full-stream TTS with dynamic speaking rate control},
  journal   = {arXiv:2603.13518},
  year      = {2026}
}