LuxTTS: Clone Any Voice in 3 Seconds, Run It on a 4GB GPU
The voice cloning bar just dropped significantly.
LuxTTS clones a voice from 3 seconds of reference audio, runs at 150x realtime speed on a single GPU, and fits entirely within 1GB of VRAM. On CPU — no GPU at all — it still runs faster than realtime. It outputs at 48kHz, double the 24kHz that most TTS models produce.
The model is built on zipvoice, weighs almost nothing, and the code is MIT licensed.
The Numbers That Matter
| Metric | LuxTTS | Typical TTS |
|---|---|---|
| VRAM required | 1GB | 4–8GB+ |
| Speed (GPU) | 150x realtime | 10–30x realtime |
| Speed (CPU) | >1x realtime | Slower than realtime |
| Output quality | 48kHz | 24kHz |
| Reference audio needed | 3 seconds | 5–30 seconds |
Running faster than realtime on CPU is the surprising one. It means LuxTTS is genuinely deployable on any machine — no GPU required, no cloud API, no latency from network calls.
How It Works
LuxTTS is built on zipvoice — a lightweight diffusion-based TTS architecture designed for efficiency without sacrificing quality. The approach prioritizes:
- Compact model size — 1GB VRAM is the entire footprint, not a minimum requirement
- Few-step generation — 3–4 diffusion steps hits the sweet spot of quality vs. speed
- Direct waveform output at 48kHz — no upsampling from a lower-quality intermediate
Compared to approaches like VoxCPM’s continuous latent space modeling, LuxTTS trades some of the architectural sophistication for raw accessibility. The goal is voice cloning that anyone can run locally, not pushing the ceiling of naturalness on high-end hardware.
Running It Locally
Install:
git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt
Basic voice clone:
from zipvoice.luxvoice import LuxTTS
import soundfile as sf
# GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Mac (MPS)
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')
text = "Whatever you want the cloned voice to say."
prompt_audio = 'your_reference.wav' # 3+ seconds
encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)
Key parameters:
num_steps— 3–4 is the efficiency sweet spot; higher improves quality but slows generationt_shift— higher values can improve naturalness but may hurt word accuracyrms— output volume (0.01 recommended)return_smooth=True— helps with metallic artifacts if you hear themref_duration— how much of the reference audio to use; set lower to speed up inference
No HuggingFace account required for basic use. Model downloads automatically on first run.
Try It Without Installing Anything
- HuggingFace Spaces: huggingface.co/spaces/YatharthS/LuxTTS
- Google Colab: colab notebook
Both are free and require no local setup.
Where It Fits
LuxTTS isn’t trying to be the highest-quality TTS model. It’s trying to be the most accessible one that’s still genuinely good.
The 1GB VRAM ceiling means it runs on hardware that couldn’t touch most voice cloning models. The CPU performance means it’s viable in environments where GPU access isn’t guaranteed — edge devices, cheap VMs, developer laptops. The 3-second reference requirement means you don’t need clean studio audio to get a usable clone.
For voice agents, content creation pipelines, accessibility tools, or any application where you need voice cloning without cloud dependency — LuxTTS is the new default to reach for before deciding you need something heavier.
- Repo: github.com/ysharma3501/LuxTTS
- Model: huggingface.co/YatharthS/LuxTTS
- License: MIT
- Try it: HuggingFace Spaces