What makes LuxTTS different from other voice cloning models?

1GB VRAM requirement (vs 4-8GB+ typical), 150x realtime speed on GPU, faster-than-realtime on CPU (no GPU needed), 48kHz output (double the 24kHz standard), and only 3 seconds of reference audio needed.

Can LuxTTS run without a GPU?

Yes — runs faster-than-realtime on CPU alone. This makes it deployable on any machine: edge devices, cheap VMs, developer laptops. No cloud API needed, no GPU required, no network latency.

How do I use LuxTTS for voice cloning?

Clone the repo, pip install requirements, encode a 3+ second reference audio with encode_prompt(), then generate_speech() with your text. Outputs 48kHz WAV. Works on CUDA, CPU, or Mac MPS. MIT licensed.

What are the key LuxTTS parameters?

num_steps: 3-4 for efficiency, higher for quality. t_shift: higher improves naturalness but may hurt word accuracy. return_smooth=True: helps with metallic artifacts. ref_duration: lower to speed up inference.

Is LuxTTS the highest-quality voice cloning model?

No — it prioritizes accessibility over ceiling quality. The goal is voice cloning anyone can run locally that's still genuinely good. For highest naturalness on high-end hardware, heavier models exist.

Can I try LuxTTS without installing anything?

Yes — free HuggingFace Spaces demo at huggingface.co/spaces/YatharthS/LuxTTS and Google Colab notebook. No local setup required.

LuxTTS: Clone Any Voice in 3 Seconds, Run It on a 4GB GPU

By Prahlad Menon Published 2026-03-14 4 min read

The voice cloning bar just dropped significantly.

LuxTTS clones a voice from 3 seconds of reference audio, runs at 150x realtime speed on a single GPU, and fits entirely within 1GB of VRAM. On CPU — no GPU at all — it still runs faster than realtime. It outputs at 48kHz, double the 24kHz that most TTS models produce.

The model is built on zipvoice, weighs almost nothing, and the code is MIT licensed.

The Numbers That Matter

Metric	LuxTTS	Typical TTS
VRAM required	1GB	4–8GB+
Speed (GPU)	150x realtime	10–30x realtime
Speed (CPU)	>1x realtime	Slower than realtime
Output quality	48kHz	24kHz
Reference audio needed	3 seconds	5–30 seconds

Running faster than realtime on CPU is the surprising one. It means LuxTTS is genuinely deployable on any machine — no GPU required, no cloud API, no latency from network calls.

How It Works

LuxTTS is built on zipvoice — a lightweight diffusion-based TTS architecture designed for efficiency without sacrificing quality. The approach prioritizes:

Compact model size — 1GB VRAM is the entire footprint, not a minimum requirement
Few-step generation — 3–4 diffusion steps hits the sweet spot of quality vs. speed
Direct waveform output at 48kHz — no upsampling from a lower-quality intermediate

Compared to approaches like VoxCPM’s continuous latent space modeling, LuxTTS trades some of the architectural sophistication for raw accessibility. The goal is voice cloning that anyone can run locally, not pushing the ceiling of naturalness on high-end hardware.

Running It Locally

Install:

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Basic voice clone:

from zipvoice.luxvoice import LuxTTS
import soundfile as sf

# GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Mac (MPS)
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

text = "Whatever you want the cloned voice to say."
prompt_audio = 'your_reference.wav'  # 3+ seconds

encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

Key parameters:

num_steps — 3–4 is the efficiency sweet spot; higher improves quality but slows generation
t_shift — higher values can improve naturalness but may hurt word accuracy
rms — output volume (0.01 recommended)
return_smooth=True — helps with metallic artifacts if you hear them
ref_duration — how much of the reference audio to use; set lower to speed up inference

No HuggingFace account required for basic use. Model downloads automatically on first run.

Try It Without Installing Anything

HuggingFace Spaces: huggingface.co/spaces/YatharthS/LuxTTS
Google Colab: colab notebook

Both are free and require no local setup.

Where It Fits

LuxTTS isn’t trying to be the highest-quality TTS model. It’s trying to be the most accessible one that’s still genuinely good.

The 1GB VRAM ceiling means it runs on hardware that couldn’t touch most voice cloning models. The CPU performance means it’s viable in environments where GPU access isn’t guaranteed — edge devices, cheap VMs, developer laptops. The 3-second reference requirement means you don’t need clean studio audio to get a usable clone.

For voice agents, content creation pipelines, accessibility tools, or any application where you need voice cloning without cloud dependency — LuxTTS is the new default to reach for before deciding you need something heavier.

How it compares: NeuTTS and RCLI

The on-device TTS space has gotten busy. Two other options worth knowing:

NeuTTS (by Neuphonic) — LLM backbone + NeuCodec (50Hz neural audio codec, single codebook). Instant voice cloning from 3 seconds of audio. Available in GGUF Q4/Q8 quantization — runs on phones, Raspberry Pi, and CPU-only machines. Multilingual: English, Spanish, French, German. Apache 2.0, watermarked output. This is the ultra-low-resource option: if you need TTS on a Pi or without any GPU at all, NeuTTS is the pick. LuxTTS needs at least 1GB VRAM or a capable CPU; NeuTTS is designed for a step below that.

RCLI’s MetalRT TTS — part of RCLI’s full voice AI pipeline for Apple Silicon. Not a standalone TTS library, but the TTS component of an end-to-end voice assistant. MetalRT is hand-optimized for Apple’s Metal GPU and runs faster-than-realtime on M3+.

The practical decision tree:

Scenario	Best pick
GPU available (1GB+ VRAM), need voice cloning	LuxTTS
No GPU, Raspberry Pi or CPU-only	NeuTTS
Apple Silicon Mac, want full voice pipeline	RCLI + MetalRT
Need multilingual TTS on-device	NeuTTS
Need fastest possible generation on CUDA	LuxTTS

Repo: github.com/ysharma3501/LuxTTS
Model: huggingface.co/YatharthS/LuxTTS
License: MIT
Try it: HuggingFace Spaces