site-md: Serve Markdown to AI Agents, HTML to Humans — Same URLs

By Prahlad Menon 3 min read

Here’s the problem: AI agents are terrible at parsing HTML. They waste tokens on nav bars, cookie banners, and script tags just to get at your actual content. Meanwhile, your human visitors need all that chrome.

site-md solves this with content negotiation. Same URL, different response based on who’s asking:

GET /docs        → <html>…</html>     (humans)
GET /docs.md     → # Docs …           (agents)
GET /docs        → # Docs …           (Accept: text/markdown)

One command install. No content duplication. No separate API.

Install

npx site-md

That’s the entire setup. The CLI:

  • Writes middleware.ts (or merges into your existing one)
  • Adds an API route at app/api/site-md/[...path]/route.ts
  • Wraps your next.config with withNextMd

Test it:

curl http://localhost:3000/           # HTML
curl http://localhost:3000/index.md   # Markdown
curl http://localhost:3000/llms.txt   # Site index for LLMs

How Detection Works

A request gets Markdown when any of these match:

TriggerExample
Path ends with .md/docs.md, /blog/post.md
?format=md query param/docs?format=md
Accept: text/markdown headerAgents negotiating content type
Known bot User-AgentGPTBot, ClaudeBot, Googlebot
Path is /llms.txt or /llms-full.txtStandard LLM index files

Everything else passes through untouched. Your human visitors never see a difference.

Why This Matters Practically

For agent builders

If you’re building agents that consume web content, you already know the pain: Jina Reader, Firecrawl, r.jina.ai wrappers — all workarounds for the fact that websites serve bloated HTML to everyone.

site-md puts the solution on the publisher side. If sites adopt this, your agent can just append .md to any URL and get clean content. No third-party extraction service needed.

For site owners

Your content is already being consumed by AI. The question is whether it’s being consumed well. Garbled HTML parsing means:

  • Your docs get misquoted
  • Your product gets misunderstood
  • Your content gets attributed incorrectly

Serving clean Markdown means AI agents represent your content accurately.

For SEO/discovery

The /llms.txt endpoint is becoming a de facto standard (like robots.txt for AI). It tells agents what content is available and how to access it. site-md generates this automatically from your sitemap.

Configuration

The defaults work, but you can tune:

import { withNextMd } from "site-md/config";

export default withNextMd(
  { reactStrictMode: true },
  {
    cacheTTL: 600,                        // cache Markdown 10 min
    passthrough: ["/admin/*", "/app/*"],   // never convert these
    stripSelectors: [".cookie-banner"],    // remove from output
    bots: {
      trainingScrapers: "block",          // block GPTBot, Bytespider
      searchCrawlers: "markdown",
      userAgents: "markdown",
    },
    llmsTxt: {
      title: "My Site",
      description: "Public docs for AI consumers",
      sitemapUrl: "/sitemap.xml",
    },
  }
);

The bots config is smart — you can serve Markdown to legitimate agents while blocking training scrapers entirely. Granular control over who gets what.

The Bigger Picture: Content Negotiation for AI

This is essentially HTTP content negotiation — a pattern from the ’90s — applied to the AI era. The server looks at who’s asking and responds appropriately:

  • Accept: text/html → Full rendered page
  • Accept: text/markdown → Clean content
  • Known bot UA → Clean content
  • .md extension → Clean content

It’s elegant because it uses existing web standards rather than inventing new infrastructure.

How It Compares

ApproachPublisher effortAgent effortQuality
Raw HTML scrapingNoneHigh (parsing)Poor
Jina Reader / FirecrawlNoneMedium (API call)Good
Separate /api/contentHigh (build + maintain)LowGood
site-mdLow (one command)Low (append .md)Excellent

The tradeoff is clear: minimal publisher effort, minimal agent effort, maximum content fidelity.

Limitations

  • Next.js only — this is middleware-based, tightly coupled to Next.js App Router. If you’re on Astro, Remix, or plain HTML, you’ll need a different approach.
  • Dynamic content — works best for content-heavy pages. Highly interactive SPAs won’t produce useful Markdown.
  • Cache invalidation — the cacheTTL means agents might see slightly stale content. Set it low for fast-changing pages.

Try It

If you run a Next.js site with docs, a blog, or any content that agents should be able to consume cleanly:

npx site-md --title "Your Site" --description "Docs for humans and machines" --yes

Two files. Zero content duplication. Your site now speaks both languages.


site-md on GitHub — MIT licensed, by @yazinsai