HolmesGPT is an open-source AI agent for investigating production incidents and finding root causes. It's a CNCF Sandbox project built by Robusta.dev that automatically queries logs, metrics, and traces from your observability stack, analyzes them with an LLM, and produces actionable root cause analysis.

How do I install HolmesGPT?

Install via Homebrew with 'brew tap robusta-dev/homebrew-holmesgpt && brew install holmesgpt' or via pipx with 'pipx install holmesgpt'. Set your LLM API key (e.g., ANTHROPIC_API_KEY) and run 'holmes ask' commands.

What LLM providers does HolmesGPT support?

HolmesGPT supports virtually any LLM provider including OpenAI, Anthropic, Azure, AWS Bedrock, Google Gemini, and local Ollama models. The team recommends Claude Sonnet 4.0 or 4.5 for best results.

What data sources can HolmesGPT query?

HolmesGPT has 30+ integrations including Prometheus, Grafana, Datadog, Loki, Kubernetes API, AlertManager, PagerDuty, and OpsGenie. It uses server-side filtering and JSON tree traversal to handle petabyte-scale observability data.

How does HolmesGPT operator mode work?

In operator mode, HolmesGPT runs as a Kubernetes deployment that continuously monitors alerts. It automatically investigates alerts from AlertManager, PagerDuty, or OpsGenie, then writes findings back to the source or pushes them to Slack.

Can HolmesGPT send alerts to Slack?

Yes. When configured, HolmesGPT fetches alerts from your alerting system, runs a full AI-powered investigation, and posts the root cause analysis with remediation steps directly to your Slack channel.

Is HolmesGPT free to use?

Yes, HolmesGPT is Apache 2.0 licensed and open source. You only pay for the LLM API calls. It fills a gap that commercial AIOps platforms charge thousands per month for.

HolmesGPT: The Open-Source AI Agent That Finds Root Causes Before You Even Notice Something Broke

By Prahlad Menon Published 2026-03-04 4 min read

Picture this: It’s 3 AM, your pager goes off, and you spend the next two hours digging through Prometheus metrics, Grafana dashboards, and Kubernetes pod logs trying to figure out why your payment service is throwing 500s. By the time you find the root cause—a misconfigured node selector—the damage is done.

What if an AI agent could do all that investigation for you, automatically, and send you a Slack message with the full context before you even notice something broke?

That’s exactly what HolmesGPT does.

What is HolmesGPT?

HolmesGPT is an open-source AI agent for investigating production incidents and finding root causes. It’s a CNCF Sandbox project built by Robusta.dev, which means it’s got serious backing and an active community behind it.

The core idea is simple but powerful: instead of you manually querying logs, metrics, and traces across a dozen different tools, HolmesGPT uses an agentic loop to automatically gather context from your observability stack, analyze it with an LLM, and produce actionable root cause analysis.

How It Actually Works

HolmesGPT uses what they call “toolsets”—deep integrations with your existing monitoring infrastructure. When an alert fires or you ask it a question, the agent:

Queries your data sources - Prometheus, Grafana, Datadog, Loki, Kubernetes API, and 30+ other integrations
Filters intelligently - Server-side filtering and JSON tree traversal keep large payloads out of the LLM context window
Analyzes patterns - The LLM identifies anomalies, correlates events, and traces the failure chain
Delivers findings - Results go back to AlertManager, PagerDuty, OpsGenie, Jira, or Slack

The key differentiator? Petabyte-scale data handling. Most AI tools choke on production telemetry. HolmesGPT is designed to handle massive observability datasets by being smart about what it sends to the model.

Getting Started in 5 Minutes

Installation

The easiest way to install is via Homebrew (Mac/Linux):

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

Or via pipx:

pipx install holmesgpt

Configure Your LLM Provider

HolmesGPT supports virtually any LLM provider—OpenAI, Anthropic, Azure, AWS Bedrock, Google Gemini, and even local Ollama models. The team recommends Claude Sonnet 4.0 or 4.5 for best results.

export ANTHROPIC_API_KEY="your-api-key"

Run Your First Investigation

holmes ask "what pods are unhealthy and why?" --model="anthropic/claude-sonnet-4-5-20250929"

That’s it. HolmesGPT will automatically query your Kubernetes cluster, gather pod status, events, and logs, then return a clear root cause analysis with specific remediation steps.

Connecting to Your Observability Stack

The real power comes when you connect HolmesGPT to your full monitoring stack. Here’s a sample configuration for Prometheus and Grafana Loki:

# ~/.holmes/config.yaml
toolsets:
  prometheus:
    enabled: true
    url: "http://prometheus:9090"
  
  grafana_loki:
    enabled: true
    url: "http://loki:3100"
    
  kubernetes:
    enabled: true  # Uses your kubeconfig automatically

Now when you ask “why did the checkout service start failing at 2pm?”, HolmesGPT can correlate metrics spikes, log errors, and Kubernetes events to pinpoint the exact cause.

Operator Mode: Continuous Monitoring

For true “find issues before you notice” capability, deploy HolmesGPT as a Kubernetes operator:

helm repo add holmesgpt https://holmesgpt.github.io/holmesgpt
helm install holmesgpt holmesgpt/holmesgpt \
  --set anthropic.apiKey=$ANTHROPIC_API_KEY

In operator mode, HolmesGPT runs investigations on a schedule, automatically analyzing alerts from AlertManager, PagerDuty, or OpsGenie, then writing findings back to the source—or pushing them to Slack.

Slack Integration

The Slack integration is where this gets really interesting for on-call engineers. When configured, HolmesGPT will:

Fetch alerts from your alerting system
Run a full AI-powered investigation
Post the root cause analysis directly to your Slack channel

You wake up to a Slack message that says “The payment-service pod is failing due to OOM kills. Memory limit is set to 512Mi but the service is consuming 800Mi during peak load. Recommendation: Increase memory limit to 1Gi or investigate the memory leak in the batch processing loop.”

That’s the kind of context that turns a 2-hour incident into a 10-minute fix.

Why This Matters

The claims of “95% less production downtime” might sound like marketing speak, but the logic is sound:

Faster MTTR - Automated investigation means you skip the manual log-diving phase entirely
Proactive detection - Operator mode catches issues before they page you
Context preservation - The AI synthesizes information across tools that would take you 20+ minutes to correlate manually

For teams running Kubernetes in production, HolmesGPT fills a gap that commercial AIOps platforms charge thousands per month for—and it does it with the transparency and flexibility of open source.

References

HolmesGPT is Apache 2.0 licensed and actively maintained. If you’re running Kubernetes in production and tired of 3 AM debugging sessions, this is worth the 5 minutes to set up.