Full-Stream Text-to-Speech with Extremely Low Latency
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
Try VoXtream ⚡ in your browser on HuggingFace 🤗 spaces.
| Speech Prompt | Text | XTTS-v2 | VoiceCraft | VoXtream |
|---|---|---|---|---|
| In general, however, some method is then needed to evaluate each approximation. | ||||
| His half-brother, Richard Ainley, was also an actor. | ||||
| A number of choirs, bands and sporting clubs are also present. | ||||
| Staff do not always do enough to prevent violence. |
In the full-stream scenario, the input text comes incrementally, word by word, emulating the output of LLM.
| Speech Prompt | Text | CosyVoice2 | VoXtream |
|---|---|---|---|
| were too much for his pursuer and he was able to flap his way onward in a cloud of foam while doom hung low above his head yet hesitated to strike | |||
| the other and only in consequence of that identity had hester contrived so perfectly to represent the scarlet letter in her appearance | |||
| under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting | |||
| for my servant girl for when she was brought to life she would not be proud nor haughty as the glass cat is for such a dreadful mixture of colors would discourage her from trying to be as dignified as the blue munchkins are |
@article{torgashov2025voxtream,
author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
journal = {arXiv:2509.15969},
year = {2025}
}