RCLI is an open-source on-device voice AI for macOS. It runs a complete speech-to-text + LLM + text-to-speech pipeline entirely on Apple Silicon — no cloud, no API keys, no internet required. You speak, it understands, it responds, all in under 200ms. Supports 38 macOS voice actions, local RAG over your documents, and hot-swappable LLMs from a terminal UI.

What is MetalRT and why does it matter?

MetalRT is RunAnywhere's proprietary GPU inference engine built specifically for Apple Silicon's Metal GPU. It's faster than llama.cpp and Apple MLX for LLM decode throughput on M3 chips, and its STT is 714x faster than real-time. M3 or later is required for MetalRT; M1/M2 fall back to llama.cpp automatically.

What Mac hardware do I need?

macOS 13+ on any Apple Silicon Mac (M1, M2, M3, M4). MetalRT's full performance requires M3 or later. M1 and M2 fall back to llama.cpp automatically and still work — just without the MetalRT speed advantage. Models are ~1GB, downloaded once during setup.

What can RCLI actually do by voice?

38 macOS actions including: control Spotify (play, pause, skip, search by genre), adjust system volume, open and switch apps, query your local documents by voice, have a natural conversation with a local LLM, run one-shot commands. Full list at github.com/RunanywhereAI/RCLI.

How does the local RAG work in RCLI?

RCLI ingests your local documents (PDFs, text, markdown) into a local vector store. When you ask a question by voice, it retrieves relevant chunks using hybrid search in ~4ms and feeds them to the local LLM for an answer — all on-device. No document ever leaves your machine.

How is RCLI different from Apple Intelligence or Siri?

Apple Intelligence sends data to Apple's servers for most requests. Siri is cloud-dependent. RCLI runs entirely on your Mac — LLM inference, speech recognition, and voice synthesis all happen locally. You choose the model, you control the data, and it works offline. The tradeoff: smaller models than what Apple or OpenAI run in the cloud.

RCLI: On-Device Voice AI for Mac — Sub-200ms, No Cloud, Full RAG

By Prahlad Menon Published 2026-03-14 7 min read

Every time you use Siri or Apple Intelligence, your voice goes to a server. Your question travels to a data center, gets processed, comes back. Even the “on-device” parts of Apple Intelligence offload to Private Cloud Compute for heavier tasks.

RCLI does all of it on your Mac. Speech recognition, LLM inference, voice synthesis — one pipeline, your hardware, under 200 milliseconds end-to-end.

What it does

Open your terminal, run rcli listen, and start talking. RCLI:

Listens — Silero VAD (voice activity detection) catches when you start and stop speaking
Transcribes — Zipformer streaming STT converts speech to text in real time
Thinks — a local LLM (your choice) generates a response
Speaks back — TTS converts the response to audio on-device

The full round-trip — voice in, voice out — happens in under 200ms on Apple Silicon. That’s faster than most cloud voice assistants with a good internet connection, and it requires no internet at all.

38 macOS actions by voice

Beyond conversation, RCLI exposes 38 system actions you can trigger by voice:

rcli ask "play some jazz on Spotify"
rcli ask "turn the volume down"
rcli ask "open Safari"
rcli ask "what's in my project docs about the Q3 budget?"

Spotify control, system volume, app launching, document queries — all spoken, all local.

Local RAG over your documents

This is the part that makes RCLI genuinely useful for knowledge work. You can ingest your own documents — PDFs, text files, markdown — and ask questions about them by voice:

rcli ask "what did the contract say about termination clauses?"
rcli ask "summarize my notes from last Tuesday's meeting"

RCLI uses hybrid retrieval (semantic + keyword) with ~4ms latency to find relevant chunks, then feeds them to the local LLM for an answer. The entire pipeline — from spoken question to spoken answer, with document retrieval in the middle — happens on your machine. Nothing leaves it.

This is the same hybrid RAG architecture we covered in 7 RAG Patterns for 2026 and the RAG Storage Decision Guide — applied to an on-device voice interface. If you’ve been building RAG pipelines and wondered what the local-first version looks like at low latency, this is it.

MetalRT: why the numbers are real

The performance claims are backed by a proprietary GPU inference engine. RunAnywhere built MetalRT specifically for Apple Silicon’s Metal GPU — rather than using general-purpose frameworks like llama.cpp or Apple MLX.

The benchmarks from the repo:

STT: 714x faster than real-time — a 10-second audio clip is transcribed in ~14ms
LLM decode throughput: faster than llama.cpp and Apple MLX on M3 Max
TTS: real-time factor well below 1.0 — synthesis faster than playback speed

MetalRT requires M3 or later to use. If you have an M1 or M2, RCLI automatically falls back to llama.cpp — you still get the full feature set, just without the MetalRT performance advantage.

For context on what these speeds mean practically: LuxTTS runs at 150x realtime on consumer GPUs. MetalRT’s 714x realtime STT is in the same ballpark of “fast enough that the transcription is never the bottleneck.”

How the LLM layer actually works — it’s not Ollama

This is the question the README buries. RCLI does not use Ollama or any other general-purpose local LLM server. It has its own inference stack.

The full pipeline under the hood:

Component	What it uses
Voice Activity Detection	Silero VAD
Speech-to-Text	Zipformer streaming (real-time) + Whisper / Parakeet (offline)
LLM Inference	MetalRT engine running Qwen3, LFM2, or Qwen3.5
Text-to-Speech	MetalRT double-buffered sentence-level synthesis
Tool Calling	LLM-native tool call formats (Qwen3, LFM2)
Memory	Sliding window conversation history with token-budget trimming

The LLMs it ships with:

Qwen3 — Alibaba’s 1.7B–7B parameter models, strong reasoning for their size, natively support tool calling
LFM2 — Liquid Foundation Model 2, designed for efficiency on edge hardware
Qwen3.5 — newer generation, improved instruction following

These are all quantized small models (1–7B range) loaded and run through MetalRT, not through Ollama, llama.cpp (except as fallback on M1/M2), or any external server process. Everything runs in-process.

Why not Ollama? Ollama is a general-purpose local LLM server that works across hardware. MetalRT is hand-optimized for Apple Silicon’s Metal GPU — it exploits specific hardware capabilities (unified memory bandwidth, Metal shader pipelines) that a generic framework can’t fully utilize. That’s where the 714x realtime STT and faster-than-llama.cpp LLM decode come from.

Can you swap in your own model? The TUI has a model browser where you can swap between the supported models (Qwen3, LFM2, Qwen3.5 variants). Arbitrary GGUF models via Ollama are not currently supported — RCLI’s model support is tied to what MetalRT has been optimized for.

The tool calling architecture: The LLM itself decides when to call a macOS action. When you say “play jazz on Spotify,” the LLM generates a structured tool call (play_on_spotify), RCLI intercepts it, and executes it via AppleScript or shell command locally. No separate intent classifier — the LLM handles routing natively using its built-in tool call format.

One practical note from the repo: with small LLMs, accumulated conversation context can degrade tool calling reliability. If voice commands stop working as expected, press X in the TUI to clear context and reset.

Getting started

Requirements: macOS 13+, Apple Silicon (M1 or later), ~1GB disk for models.

# Option 1: Homebrew (recommended)
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup   # downloads models ~1GB, one-time

# Option 2: curl
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

# Interactive TUI (push-to-talk + text input)
rcli

# Continuous listening mode
rcli listen

# One-shot command
rcli ask "open Terminal and create a new window"

# Browse and hot-swap LLMs
rcli metalrt   # manage MetalRT engine
rcli llamacpp  # manage llama.cpp fallback

The TUI lets you browse available models and swap between them without restarting. If you want to try a different LLM for a specific task — a smaller/faster model for quick commands, a larger one for document analysis — it’s a few keypresses.

Privacy: the actual case

The privacy argument for local AI isn’t just ideological — it’s practical for specific use cases:

Legal documents — asking questions about contracts or case files without sending them to a cloud provider
Medical notes — querying clinical documents or personal health records locally
Corporate IP — interrogating internal documents that can’t leave your machine per policy
Offline work — planes, remote locations, air-gapped environments

RCLI combined with DeepDoc (deep research on local documents) or soul.py (persistent local memory) creates a genuinely useful local AI stack for sensitive knowledge work — no cloud dependency anywhere in the chain.

Honest limitations

Model quality ceiling: local 1-7B models can’t match GPT-4 or Claude for complex reasoning. RCLI is fast and private; it’s not the most capable AI you’ve used.
M3+ for full performance: MetalRT’s speed advantage requires M3 or later. M1/M2 fall back to llama.cpp — still fast, but benchmarks don’t apply.
macOS only: no Linux or Windows support. Apple Silicon is the whole premise.
Early stage: the repo is active and moving fast but is pre-1.0. Expect rough edges.

The bigger picture

The interesting thing about RCLI isn’t any individual feature — it’s what the combination represents: a complete, usable, fast voice AI stack that runs entirely on commodity consumer hardware (a MacBook), costs nothing to run (no API fees), and respects the privacy boundary between your data and the cloud.

A year ago this would have been a research demo. Today it’s brew install rcli and a 1GB model download.