Cloudflare's New /crawl Endpoint: Full Website Crawling in One API Call

By Prahlad Menon 3 min read

Web crawling just got significantly easier. Cloudflare added a /crawl endpoint to their Browser Rendering API that lets you crawl entire websites with a single API call — no headless browser infrastructure required.

What It Does

Submit a starting URL, get back every page discovered, rendered and formatted. The endpoint handles:

  • Automatic page discovery — follows links, parses sitemaps
  • JavaScript rendering — runs a real browser in Cloudflare’s cloud
  • Multiple output formats — HTML, Markdown, or structured JSON (via Workers AI)
  • Crawl scope controls — depth limits, page limits, URL patterns to include/exclude
# Start a crawl
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer <apiToken>' \
  -H 'Content-Type: application/json' \
  -d '{"url": "https://docs.example.com/"}'

# Check results
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
  -H 'Authorization: Bearer <apiToken>'

Jobs run asynchronously — you get a job ID and poll for results as pages are processed.

Why This Matters for AI

The obvious use case: RAG pipelines. Instead of:

  1. Running your own headless browser
  2. Managing browser pools and concurrency
  3. Handling JavaScript-heavy sites
  4. Converting HTML to markdown
  5. Chunking for embeddings

You now have:

  1. One API call
  2. Markdown output ready for chunking

Cloudflare explicitly calls out “training models” and “building RAG pipelines” as target use cases. They’re positioning this for AI workloads.

The Ethics Angle

Here’s what sets this apart from typical scraping tools: it’s a signed-agent that respects robots.txt and Cloudflare’s AI Crawl Control by default.

From their announcement:

“…making it easy for developers to comply with website rules, and making it less likely for crawlers to ignore web-owner guidance.”

The crawler self-identifies as a bot and honors crawl-delay directives. It cannot bypass Cloudflare bot detection or captchas. This is intentional — they want to be the “well-behaved” option in a space full of aggressive scrapers.

Incremental Crawling

For recurring crawls (monitoring, keeping RAG indexes fresh), you can use:

  • modifiedSince — skip pages that haven’t changed
  • maxAge — skip recently fetched pages

This saves time and cost on repeated crawls. Smart for documentation sites that update incrementally.

Static Mode

Not every site needs a full browser. Set render: false to fetch static HTML without spinning up a browser — faster and cheaper for static sites.

Pricing

Available on both Workers Free and Paid plans. The free tier has limits, but it’s enough to experiment.

When to Use This vs. Alternatives

Use Cloudflare /crawl when:

  • You need to crawl many pages from a site
  • You want markdown output for RAG
  • You don’t want to run browser infrastructure
  • You need to respect robots.txt (compliance matters)

Use local browser tools when:

  • You need fine-grained control over page interactions
  • You’re doing authenticated scraping
  • You need to bypass bot detection (Cloudflare won’t help you here)

Use traditional HTTP scraping when:

  • Sites are static and simple
  • You need maximum speed
  • You’re scraping at massive scale (cost considerations)

The Bigger Picture

Cloudflare is building an AI infrastructure stack: Workers AI for inference, Vectorize for embeddings, and now Browser Rendering for data ingestion. The pieces fit together — crawl sites, convert to markdown, chunk, embed, store in Vectorize, query with Workers AI.

They’re making the “build a RAG pipeline” path significantly shorter.


Links: