What is Cloudflare's /crawl endpoint?

A Browser Rendering API endpoint that crawls entire websites with a single API call. It handles automatic page discovery, JavaScript rendering, and outputs HTML, Markdown, or structured JSON. No headless browser infrastructure required — runs asynchronously in Cloudflare's cloud.

How does Cloudflare /crawl help with RAG pipelines?

Instead of running your own headless browser, managing pools, handling JavaScript, and converting HTML to markdown, you make one API call and get markdown output ready for chunking. Cloudflare explicitly targets 'training models' and 'building RAG pipelines' as use cases.

What crawl controls does the endpoint support?

Depth limits, page limits, URL patterns to include/exclude, incremental crawling with modifiedSince and maxAge parameters, and a static mode (render: false) for faster/cheaper crawls of static sites.

How does Cloudflare /crawl pricing work?

Available on both Workers Free and Paid plans. Free tier has limits but is enough to experiment. Jobs run asynchronously — you get a job ID and poll for results as pages are processed.

When should I use Cloudflare /crawl vs local browser tools?

Use Cloudflare when crawling many pages, wanting markdown for RAG, avoiding browser infrastructure, or needing robots.txt compliance. Use local tools when needing fine-grained control, authenticated scraping, or bypassing bot detection (Cloudflare won't help there).

Cloudflare's New /crawl Endpoint: Full Website Crawling in One API Call

Q: Does Cloudflare's crawl endpoint respect robots.txt?

Yes — it's a signed-agent that respects robots.txt and Cloudflare's AI Crawl Control by default. The crawler self-identifies as a bot and honors crawl-delay directives. It cannot bypass Cloudflare bot detection or captchas — intentionally the 'well-behaved' option.

By Prahlad Menon Published 2026-03-12 3 min read

Web crawling just got significantly easier. Cloudflare added a /crawl endpoint to their Browser Rendering API that lets you crawl entire websites with a single API call — no headless browser infrastructure required.

What It Does

Submit a starting URL, get back every page discovered, rendered and formatted. The endpoint handles:

Automatic page discovery — follows links, parses sitemaps
JavaScript rendering — runs a real browser in Cloudflare’s cloud
Multiple output formats — HTML, Markdown, or structured JSON (via Workers AI)
Crawl scope controls — depth limits, page limits, URL patterns to include/exclude

# Start a crawl
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer <apiToken>' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://docs.example.com/"}'

# Check results
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
  -H 'Authorization: Bearer <apiToken>'

Jobs run asynchronously — you get a job ID and poll for results as pages are processed.

Why This Matters for AI

The obvious use case: RAG pipelines. Instead of:

Running your own headless browser
Managing browser pools and concurrency
Handling JavaScript-heavy sites
Converting HTML to markdown
Chunking for embeddings

You now have:

One API call
Markdown output ready for chunking

Cloudflare explicitly calls out “training models” and “building RAG pipelines” as target use cases. They’re positioning this for AI workloads.

The Ethics Angle

Here’s what sets this apart from typical scraping tools: it’s a signed-agent that respects robots.txt and Cloudflare’s AI Crawl Control by default.

From their announcement:

“…making it easy for developers to comply with website rules, and making it less likely for crawlers to ignore web-owner guidance.”

The crawler self-identifies as a bot and honors crawl-delay directives. It cannot bypass Cloudflare bot detection or captchas. This is intentional — they want to be the “well-behaved” option in a space full of aggressive scrapers.

Incremental Crawling

For recurring crawls (monitoring, keeping RAG indexes fresh), you can use:

modifiedSince — skip pages that haven’t changed
maxAge — skip recently fetched pages

This saves time and cost on repeated crawls. Smart for documentation sites that update incrementally.

Static Mode

Not every site needs a full browser. Set render: false to fetch static HTML without spinning up a browser — faster and cheaper for static sites.

Pricing

Available on both Workers Free and Paid plans. The free tier has limits, but it’s enough to experiment.

When to Use This vs. Alternatives

Use Cloudflare /crawl when:

You need to crawl many pages from a site
You want markdown output for RAG
You don’t want to run browser infrastructure
You need to respect robots.txt (compliance matters)

Use local browser tools when:

You need fine-grained control over page interactions
You’re doing authenticated scraping
You need to bypass bot detection (Cloudflare won’t help you here)

Use traditional HTTP scraping when:

Sites are static and simple
You need maximum speed
You’re scraping at massive scale (cost considerations)

The Bigger Picture

Cloudflare is building an AI infrastructure stack: Workers AI for inference, Vectorize for embeddings, and now Browser Rendering for data ingestion. The pieces fit together — crawl sites, convert to markdown, chunk, embed, store in Vectorize, query with Workers AI.

They’re making the “build a RAG pipeline” path significantly shorter.

Links: