LuxTTS: Clone Any Voice in 3 Seconds, Run It on a 4GB GPU

By Prahlad Menon 3 min read

The voice cloning bar just dropped significantly.

LuxTTS clones a voice from 3 seconds of reference audio, runs at 150x realtime speed on a single GPU, and fits entirely within 1GB of VRAM. On CPU — no GPU at all — it still runs faster than realtime. It outputs at 48kHz, double the 24kHz that most TTS models produce.

The model is built on zipvoice, weighs almost nothing, and the code is MIT licensed.

The Numbers That Matter

MetricLuxTTSTypical TTS
VRAM required1GB4–8GB+
Speed (GPU)150x realtime10–30x realtime
Speed (CPU)>1x realtimeSlower than realtime
Output quality48kHz24kHz
Reference audio needed3 seconds5–30 seconds

Running faster than realtime on CPU is the surprising one. It means LuxTTS is genuinely deployable on any machine — no GPU required, no cloud API, no latency from network calls.

How It Works

LuxTTS is built on zipvoice — a lightweight diffusion-based TTS architecture designed for efficiency without sacrificing quality. The approach prioritizes:

  • Compact model size — 1GB VRAM is the entire footprint, not a minimum requirement
  • Few-step generation — 3–4 diffusion steps hits the sweet spot of quality vs. speed
  • Direct waveform output at 48kHz — no upsampling from a lower-quality intermediate

Compared to approaches like VoxCPM’s continuous latent space modeling, LuxTTS trades some of the architectural sophistication for raw accessibility. The goal is voice cloning that anyone can run locally, not pushing the ceiling of naturalness on high-end hardware.

Running It Locally

Install:

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Basic voice clone:

from zipvoice.luxvoice import LuxTTS
import soundfile as sf

# GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Mac (MPS)
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

text = "Whatever you want the cloned voice to say."
prompt_audio = 'your_reference.wav'  # 3+ seconds

encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

Key parameters:

  • num_steps — 3–4 is the efficiency sweet spot; higher improves quality but slows generation
  • t_shift — higher values can improve naturalness but may hurt word accuracy
  • rms — output volume (0.01 recommended)
  • return_smooth=True — helps with metallic artifacts if you hear them
  • ref_duration — how much of the reference audio to use; set lower to speed up inference

No HuggingFace account required for basic use. Model downloads automatically on first run.

Try It Without Installing Anything

Both are free and require no local setup.

Where It Fits

LuxTTS isn’t trying to be the highest-quality TTS model. It’s trying to be the most accessible one that’s still genuinely good.

The 1GB VRAM ceiling means it runs on hardware that couldn’t touch most voice cloning models. The CPU performance means it’s viable in environments where GPU access isn’t guaranteed — edge devices, cheap VMs, developer laptops. The 3-second reference requirement means you don’t need clean studio audio to get a usable clone.

For voice agents, content creation pipelines, accessibility tools, or any application where you need voice cloning without cloud dependency — LuxTTS is the new default to reach for before deciding you need something heavier.