How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control
How to Run NVIDIA PersonaPlex Locally: Full-Duplex Voice AI with Character Control
NVIDIA just open-sourced PersonaPlex β a 7B speech-to-speech model that does something no commercial voice API can match: it holds a consistent character while having a real-time, full-duplex conversation. You talk over it, it adapts. You give it a persona, it stays in character. You give it a voice sample, it sounds like that person.
MIT licensed. Runs on a single GPU. Hereβs how to set it up.
What Youβre Getting
PersonaPlex isnβt a TTS engine or a voice assistant wrapper. Itβs a single model that simultaneously:
- Listens to your speech in real-time
- Speaks back while youβre still talking (full-duplex)
- Maintains a persona defined by a text prompt
- Clones a voice from an audio sample
- Handles interruptions, barge-ins, and overlapping speech naturally
Itβs built on the Moshi architecture from Kyutai and fine-tuned by NVIDIA on synthetic + real conversation data. The key insight: rather than chaining ASR β LLM β TTS (the way most voice assistants work), PersonaPlex does everything in one pass through a single 7B model. Lower latency, more natural flow.
Prerequisites
| Requirement | Details |
|---|---|
| GPU | NVIDIA GPU with 16GB+ VRAM (RTX 4090, A100, etc.) |
| CPU fallback | --cpu-offload flag for lower VRAM; pure CPU for offline only |
| OS | Linux (Ubuntu/Debian or Fedora/RHEL) |
| Python | 3.10+ |
| HuggingFace account | Free β needed to accept model license |
| Disk | ~15GB for model weights |
No NVIDIA GPU? You can rent one on RunPod, Lambda, or Vast.ai for $0.30β1.50/hr. An A100 40GB instance is ideal.
Step 1: Install System Dependencies
PersonaPlex uses the Opus audio codec for real-time streaming. Install the development library:
# Ubuntu/Debian
sudo apt update && sudo apt install -y libopus-dev git
# Fedora/RHEL
sudo dnf install -y opus-devel git
Step 2: Clone the Repository
git clone https://github.com/NVIDIA/personaplex.git
cd personaplex
Step 3: Set Up Python Environment
Create an isolated environment to avoid dependency conflicts:
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
Install PersonaPlex (itβs packaged as moshi):
pip install moshi/.
For Blackwell GPUs (RTX 5090, B100, etc.): You need the CUDA 13.0 PyTorch build:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
For CPU offloading (if your GPU has less than 16GB VRAM):
pip install accelerate
Step 4: Get the Model Weights
- Go to nvidia/personaplex-7b-v1 on HuggingFace
- Accept the model license
- Create an access token at huggingface.co/settings/tokens
- Set your token:
export HF_TOKEN=hf_your_token_here
The model downloads automatically on first run (~15GB).
Step 5: Launch the Live Server
This is where it gets fun. One command launches a web UI with real-time voice conversation:
SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR"
The server generates temporary SSL certificates (needed for browser microphone access) and starts listening. Youβll see output like:
Access the Web UI directly at https://localhost:8998
Open that URL in your browser, allow microphone access, and start talking.
Low VRAM? Add the offload flag:
SSL_DIR=$(mktemp -d)
python -m moshi.server --ssl "$SSL_DIR" --cpu-offload
Step 6: Try Offline Processing
Donβt have a GPU handy for real-time? You can process pre-recorded audio files:
Basic Assistant Mode
HF_TOKEN=hf_your_token \
python -m moshi.offline \
--voice-prompt "NATF2.pt" \
--input-wav "assets/test/input_assistant.wav" \
--seed 42424242 \
--output-wav "output.wav" \
--output-text "output.json"
Customer Service Role
HF_TOKEN=hf_your_token \
python -m moshi.offline \
--voice-prompt "NATM1.pt" \
--text-prompt "$(cat assets/test/prompt_service.txt)" \
--input-wav "assets/test/input_service.wav" \
--seed 42424242 \
--output-wav "output.wav" \
--output-text "output.json"
For CPU-only offline processing, install the CPU PyTorch build and add --cpu-offload.
Understanding Voice Prompts
PersonaPlex ships with 18 pre-built voice embeddings:
| Category | Voices | Style |
|---|---|---|
| Natural Female | NATF0, NATF1, NATF2, NATF3 | Conversational, warm |
| Natural Male | NATM0, NATM1, NATM2, NATM3 | Conversational, natural |
| Variety Female | VARF0βVARF4 | Diverse range of tones |
| Variety Male | VARM0βVARM4 | Diverse range of tones |
Use the NAT voices for natural-sounding conversations. The VAR voices offer more character variety. Pass them via --voice-prompt:
--voice-prompt "NATF2.pt" # Natural female voice 2
--voice-prompt "VARM3.pt" # Variety male voice 3
Writing Effective Persona Prompts
The --text-prompt flag is where PersonaPlex really differentiates itself. You define the characterβs role, knowledge, and personality in plain text.
Simple Assistant
You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.
Customer Service Agent
You work for CitySan Services which is a waste management company and your name is Ayelen Lucero. Information: Verify customer name Omar Torres. Current schedule: every other week. Upcoming pickup: April 12th. Compost bin service available for $8/month add-on.
Creative Character
You enjoy having a good conversation. Have a technical discussion about fixing a reactor core on a spaceship to Mars. You are an astronaut on a Mars mission. Your name is Alex. You are already dealing with a reactor core meltdown. Several ship systems are failing, and continued instability will lead to catastrophic failure. You explain what is happening and urgently ask for help thinking through how to stabilize the reactor.
Tips for Better Prompts
- Include specific facts β names, prices, schedules. The model uses these in conversation.
- Set the emotional tone β βurgent,β βcasual,β βempatheticβ changes how it speaks.
- Give it constraints β what it knows and doesnβt know prevents hallucination.
- Start with βYou enjoy having a good conversationβ for casual/open-ended chats β this was in the training data and produces the most natural results.
Architecture: How It Works
PersonaPlex uses a dual-stream architecture based on Moshi:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β PersonaPlex (7B) β
β β
β ββββββββββββ βββββββββββββ ββββββββββββ β
β β Text β β Helium β β Audio β β
β β Prompt βββββΆβ LLM βββββΆβ Codec β β
β β (role) β β Backbone β β (Mimi) β β
β ββββββββββββ β β ββββββ¬ββββββ β
β ββββββββββββ β Dual β β β
β β Voice βββββΆβ Stream β Output Audio β
β β Prompt β β Decoder β β β
β β (audio) β βββββββ¬ββββββ β β
β ββββββββββββ β β β
β ββββββ΄βββββ β β
β Input Audio ββββΆβ Encoder β βΌ β
β (your voice) βββββββββββ Speaker Out β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Key design choices:
- Single model β no ASRβLLMβTTS pipeline. Speech in, speech out.
- Neural codec (Mimi) β encodes audio into tokens the LLM can process.
- Full-duplex β separate streams for listening and speaking, processed concurrently.
- Helium backbone β the underlying LLM from Kyutai, giving it strong language understanding.
Comparison: PersonaPlex vs Alternatives
| Feature | PersonaPlex | OpenAI Voice | ElevenLabs | Moshi (base) |
|---|---|---|---|---|
| Full-duplex | β | β | β | β |
| Self-hosted | β | β | β | β |
| Persona control | β Text prompt | Limited (system prompt) | β | β |
| Voice cloning | β Audio conditioning | β | β API only | β |
| License | MIT | Proprietary | Proprietary | CC-BY |
| Parameters | 7B | Unknown | N/A | 7B |
| Cost | Free (your GPU) | Per-minute | Per-character | Free |
| Latency | Real-time | Real-time | ~1s | Real-time |
Running on Cloud GPUs
No local GPU? Hereβs the fastest path:
RunPod (recommended)
- Create a pod with the PyTorch 2.x template and an A100 40GB GPU
- SSH in and run the install steps above
- Forward port 8998:
ssh -L 8998:localhost:8998 your-pod - Open
https://localhost:8998in your browser
Google Colab (limited)
Colabβs free T4 (16GB) may work with --cpu-offload, but donβt expect smooth real-time performance. Better for offline processing.
Troubleshooting
βCUDA out of memoryβ β Add --cpu-offload to your command. This moves some layers to RAM.
Browser says βNot Secureβ β Expected β the SSL certs are self-signed. Click βAdvancedβ β βProceed.β
No audio output β Check that your browser has microphone permissions. Chrome works best.
Model download fails β Verify you accepted the license at huggingface.co/nvidia/personaplex-7b-v1 and your HF_TOKEN is set correctly.
Blackwell GPU errors β Install the CUDA 13.0 PyTorch build: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
What to Build With This
PersonaPlex is MIT licensed and commercially ready. Some ideas:
- AI receptionist β give it your business info, let it answer calls
- Language tutor β set the persona as a patient teacher, practice conversation
- Game NPCs β each character gets a unique voice + personality prompt
- Customer service training β simulate difficult customer scenarios
- Podcast co-host β set a personality and have it riff on topics in real-time
- Accessibility β voice interfaces for applications that currently require text
The key advantage over API-based solutions: zero marginal cost per conversation. Once you have the GPU, every additional minute of conversation is free.
PersonaPlex is MIT licensed on GitHub. Paper: arXiv:2602.06053. Model: nvidia/personaplex-7b-v1 on HuggingFace.