VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

Nikita Torgashov Gustav Eje Henter Gabriel Skantze

Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden

Overview

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

Try VoXtream ⚡ in your browser on HuggingFace 🤗 spaces.

Short-form Zero-shot TTS

Speech Prompt	Text	XTTS-v2	VoiceCraft	VoXtream
	In general, however, some method is then needed to evaluate each approximation.
	His half-brother, Richard Ainley, was also an actor.
	A number of choirs, bands and sporting clubs are also present.
	Staff do not always do enough to prevent violence.

Full-stream Zero-shot TTS

In the full-stream scenario, the input text comes incrementally, word by word, emulating the output of LLM.

Speech Prompt	Text	CosyVoice2	VoXtream
	were too much for his pursuer and he was able to flap his way onward in a cloud of foam while doom hung low above his head yet hesitated to strike
	the other and only in consequence of that identity had hester contrived so perfectly to represent the scarlet letter in her appearance
	under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting
	for my servant girl for when she was brought to life she would not be proud nor haughty as the glass cat is for such a dreadful mixture of colors would discourage her from trying to be as dignified as the blue munchkins are

Citation

@article{torgashov2025voxtream,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  journal   = {arXiv:2509.15969},
  year      = {2025}
}