RCLI: On-Device Voice AI for Mac — Sub-200ms, No Cloud, Full RAG
Every time you use Siri or Apple Intelligence, your voice goes to a server. Your question travels to a data center, gets processed, comes back. Even the “on-device” parts of Apple Intelligence offload to Private Cloud Compute for heavier tasks.
RCLI does all of it on your Mac. Speech recognition, LLM inference, voice synthesis — one pipeline, your hardware, under 200 milliseconds end-to-end.
What it does
Open your terminal, run rcli listen, and start talking. RCLI:
- Listens — Silero VAD (voice activity detection) catches when you start and stop speaking
- Transcribes — Zipformer streaming STT converts speech to text in real time
- Thinks — a local LLM (your choice) generates a response
- Speaks back — TTS converts the response to audio on-device
The full round-trip — voice in, voice out — happens in under 200ms on Apple Silicon. That’s faster than most cloud voice assistants with a good internet connection, and it requires no internet at all.
38 macOS actions by voice
Beyond conversation, RCLI exposes 38 system actions you can trigger by voice:
rcli ask "play some jazz on Spotify"
rcli ask "turn the volume down"
rcli ask "open Safari"
rcli ask "what's in my project docs about the Q3 budget?"
Spotify control, system volume, app launching, document queries — all spoken, all local.
Local RAG over your documents
This is the part that makes RCLI genuinely useful for knowledge work. You can ingest your own documents — PDFs, text files, markdown — and ask questions about them by voice:
rcli ask "what did the contract say about termination clauses?"
rcli ask "summarize my notes from last Tuesday's meeting"
RCLI uses hybrid retrieval (semantic + keyword) with ~4ms latency to find relevant chunks, then feeds them to the local LLM for an answer. The entire pipeline — from spoken question to spoken answer, with document retrieval in the middle — happens on your machine. Nothing leaves it.
This is the same hybrid RAG architecture we covered in 7 RAG Patterns for 2026 and the RAG Storage Decision Guide — applied to an on-device voice interface. If you’ve been building RAG pipelines and wondered what the local-first version looks like at low latency, this is it.
MetalRT: why the numbers are real
The performance claims are backed by a proprietary GPU inference engine. RunAnywhere built MetalRT specifically for Apple Silicon’s Metal GPU — rather than using general-purpose frameworks like llama.cpp or Apple MLX.
The benchmarks from the repo:
- STT: 714x faster than real-time — a 10-second audio clip is transcribed in ~14ms
- LLM decode throughput: faster than llama.cpp and Apple MLX on M3 Max
- TTS: real-time factor well below 1.0 — synthesis faster than playback speed
MetalRT requires M3 or later to use. If you have an M1 or M2, RCLI automatically falls back to llama.cpp — you still get the full feature set, just without the MetalRT performance advantage.
For context on what these speeds mean practically: LuxTTS runs at 150x realtime on consumer GPUs. MetalRT’s 714x realtime STT is in the same ballpark of “fast enough that the transcription is never the bottleneck.”
How the LLM layer actually works — it’s not Ollama
This is the question the README buries. RCLI does not use Ollama or any other general-purpose local LLM server. It has its own inference stack.
The full pipeline under the hood:
| Component | What it uses |
|---|---|
| Voice Activity Detection | Silero VAD |
| Speech-to-Text | Zipformer streaming (real-time) + Whisper / Parakeet (offline) |
| LLM Inference | MetalRT engine running Qwen3, LFM2, or Qwen3.5 |
| Text-to-Speech | MetalRT double-buffered sentence-level synthesis |
| Tool Calling | LLM-native tool call formats (Qwen3, LFM2) |
| Memory | Sliding window conversation history with token-budget trimming |
The LLMs it ships with:
- Qwen3 — Alibaba’s 1.7B–7B parameter models, strong reasoning for their size, natively support tool calling
- LFM2 — Liquid Foundation Model 2, designed for efficiency on edge hardware
- Qwen3.5 — newer generation, improved instruction following
These are all quantized small models (1–7B range) loaded and run through MetalRT, not through Ollama, llama.cpp (except as fallback on M1/M2), or any external server process. Everything runs in-process.
Why not Ollama? Ollama is a general-purpose local LLM server that works across hardware. MetalRT is hand-optimized for Apple Silicon’s Metal GPU — it exploits specific hardware capabilities (unified memory bandwidth, Metal shader pipelines) that a generic framework can’t fully utilize. That’s where the 714x realtime STT and faster-than-llama.cpp LLM decode come from.
Can you swap in your own model? The TUI has a model browser where you can swap between the supported models (Qwen3, LFM2, Qwen3.5 variants). Arbitrary GGUF models via Ollama are not currently supported — RCLI’s model support is tied to what MetalRT has been optimized for.
The tool calling architecture: The LLM itself decides when to call a macOS action. When you say “play jazz on Spotify,” the LLM generates a structured tool call (play_on_spotify), RCLI intercepts it, and executes it via AppleScript or shell command locally. No separate intent classifier — the LLM handles routing natively using its built-in tool call format.
One practical note from the repo: with small LLMs, accumulated conversation context can degrade tool calling reliability. If voice commands stop working as expected, press X in the TUI to clear context and reset.
Getting started
Requirements: macOS 13+, Apple Silicon (M1 or later), ~1GB disk for models.
# Option 1: Homebrew (recommended)
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup # downloads models ~1GB, one-time
# Option 2: curl
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
# Interactive TUI (push-to-talk + text input)
rcli
# Continuous listening mode
rcli listen
# One-shot command
rcli ask "open Terminal and create a new window"
# Browse and hot-swap LLMs
rcli metalrt # manage MetalRT engine
rcli llamacpp # manage llama.cpp fallback
The TUI lets you browse available models and swap between them without restarting. If you want to try a different LLM for a specific task — a smaller/faster model for quick commands, a larger one for document analysis — it’s a few keypresses.
Privacy: the actual case
The privacy argument for local AI isn’t just ideological — it’s practical for specific use cases:
- Legal documents — asking questions about contracts or case files without sending them to a cloud provider
- Medical notes — querying clinical documents or personal health records locally
- Corporate IP — interrogating internal documents that can’t leave your machine per policy
- Offline work — planes, remote locations, air-gapped environments
RCLI combined with DeepDoc (deep research on local documents) or soul.py (persistent local memory) creates a genuinely useful local AI stack for sensitive knowledge work — no cloud dependency anywhere in the chain.
Honest limitations
- Model quality ceiling: local 1-7B models can’t match GPT-4 or Claude for complex reasoning. RCLI is fast and private; it’s not the most capable AI you’ve used.
- M3+ for full performance: MetalRT’s speed advantage requires M3 or later. M1/M2 fall back to llama.cpp — still fast, but benchmarks don’t apply.
- macOS only: no Linux or Windows support. Apple Silicon is the whole premise.
- Early stage: the repo is active and moving fast but is pre-1.0. Expect rough edges.
The bigger picture
The interesting thing about RCLI isn’t any individual feature — it’s what the combination represents: a complete, usable, fast voice AI stack that runs entirely on commodity consumer hardware (a MacBook), costs nothing to run (no API fees), and respects the privacy boundary between your data and the cloud.
A year ago this would have been a research demo. Today it’s brew install rcli and a 1GB model download.
Related: LuxTTS — 150x Realtime Voice Cloning, 1GB VRAM · DeepDoc — Deep Research on Local Documents · 7 RAG Patterns in 2026 · RAG Storage Decision Guide · soul.py — Persistent Local Memory · CrawlAI RAG — Turn Any Source Into a Knowledge Base