Full-Stream Text-to-Speech with Extremely Low Latency
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.
Speech Prompt | Text | XTTS-v2 | VoiceCraft | VoXtream |
---|---|---|---|---|
In general, however, some method is then needed to evaluate each approximation. | ||||
His half-brother, Richard Ainley, was also an actor. | ||||
A number of choirs, bands and sporting clubs are also present. | ||||
Staff do not always do enough to prevent violence. |
In the full-stream scenario, the input text comes incrementally, word by word, emulating the output of LLM.
Speech Prompt | Text | CosyVoice2 | VoXtream |
---|---|---|---|
were too much for his pursuer and he was able to flap his way onward in a cloud of foam while doom hung low above his head yet hesitated to strike | |||
the other and only in consequence of that identity had hester contrived so perfectly to represent the scarlet letter in her appearance | |||
under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting | |||
for my servant girl for when she was brought to life she would not be proud nor haughty as the glass cat is for such a dreadful mixture of colors would discourage her from trying to be as dignified as the blue munchkins are |
VoXtream is able to generate very long sequences in a full-stream scenario with high intelligibility. The model is punctuation agnostic and works with a very limited look-ahead of at most 10 phonemes, which is why the prosody is not perfect. Although VoXtream can preserve the speaker's voice and produce understandable speech for up to a minute. An example of such generation:
Text: We present VoXtream, a zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment and a dynamic look-ahead. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves the lowest initial delay among public streaming text-to-speech: 102 milliseconds. Despite being trained on a 9000-hour corpus, it matches or surpasses larger baselines on several metrics. The audio you are listening to is generated by our model in full-streaming mode.
@article{torgashov2025voxtream, author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel}, title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency}, journal = {arXiv:2509.15969}, year = {2025} }