Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Overview
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.
Try VoXtream ⚡ in your browser on HuggingFace 🤗 spaces.
Short-form Zero-shot TTS
Speech Prompt
Text
XTTS-v2
VoiceCraft
VoXtream
In general, however, some method is then needed to evaluate each approximation.
His half-brother, Richard Ainley, was also an actor.
A number of choirs, bands and sporting clubs are also present.
Staff do not always do enough to prevent violence.
Full-stream Zero-shot TTS
In the full-stream scenario, the input text comes incrementally, word by word, emulating the output of LLM.
Speech Prompt
Text
CosyVoice2
VoXtream
were too much for his pursuer and he was able to flap his way onward in a cloud of foam while doom hung low above his head yet hesitated to strike
the other and only in consequence of that identity had hester contrived so perfectly to represent the scarlet letter in her appearance
under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting
for my servant girl for when she was brought to life she would not be proud nor haughty as the glass cat is for such a dreadful mixture of colors would discourage her from trying to be as dignified as the blue munchkins are
Citation
@inproceedings{torgashov2026voxtream,
title={Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
author={Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
booktitle={Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026},
note={to appear},
url={https://arxiv.org/abs/2509.15969}
}