VoXtream

Full-Stream Text-to-Speech with Extremely Low Latency

View the Project on GitHub herimor/voxtream

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden

Overview

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

  • Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
  • Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
  • Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

Architecture

Short-form Zero-shot TTS

Speech Prompt Text XTTS-v2 VoiceCraft VoXtream

In general, however, some method is then needed to evaluate each approximation.



His half-brother, Richard Ainley, was also an actor.



A number of choirs, bands and sporting clubs are also present.



Staff do not always do enough to prevent violence.


Full-stream Zero-shot TTS

In the full-stream scenario, the input text comes incrementally, word by word, emulating the output of LLM.

Speech Prompt Text CosyVoice2 VoXtream

were too much for his pursuer and he was able to flap his way onward in a cloud of foam while doom hung low above his head yet hesitated to strike


the other and only in consequence of that identity had hester contrived so perfectly to represent the scarlet letter in her appearance


under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting


for my servant girl for when she was brought to life she would not be proud nor haughty as the glass cat is for such a dreadful mixture of colors would discourage her from trying to be as dignified as the blue munchkins are

Long-form TTS

VoXtream is able to generate very long sequences in a full-stream scenario with high intelligibility. The model is punctuation agnostic and works with a very limited look-ahead of at most 10 phonemes, which is why the prosody is not perfect. Although VoXtream can preserve the speaker's voice and produce understandable speech for up to a minute. An example of such generation:

Text: We present VoXtream, a zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment and a dynamic look-ahead. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves the lowest initial delay among public streaming text-to-speech: 102 milliseconds. Despite being trained on a 9000-hour corpus, it matches or surpasses larger baselines on several metrics. The audio you are listening to is generated by our model in full-streaming mode.

Citation

@article{torgashov2025voxtream,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  journal   = {arXiv:2509.15969},
  year      = {2025}
}