Cloudflare's New /crawl Endpoint: Full Website Crawling in One API Call
Web crawling just got significantly easier. Cloudflare added a /crawl endpoint to their Browser Rendering API that lets you crawl entire websites with a single API call — no headless browser infrastructure required.
What It Does
Submit a starting URL, get back every page discovered, rendered and formatted. The endpoint handles:
- Automatic page discovery — follows links, parses sitemaps
- JavaScript rendering — runs a real browser in Cloudflare’s cloud
- Multiple output formats — HTML, Markdown, or structured JSON (via Workers AI)
- Crawl scope controls — depth limits, page limits, URL patterns to include/exclude
# Start a crawl
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{"url": "https://docs.example.com/"}'
# Check results
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
-H 'Authorization: Bearer <apiToken>'
Jobs run asynchronously — you get a job ID and poll for results as pages are processed.
Why This Matters for AI
The obvious use case: RAG pipelines. Instead of:
- Running your own headless browser
- Managing browser pools and concurrency
- Handling JavaScript-heavy sites
- Converting HTML to markdown
- Chunking for embeddings
You now have:
- One API call
- Markdown output ready for chunking
Cloudflare explicitly calls out “training models” and “building RAG pipelines” as target use cases. They’re positioning this for AI workloads.
The Ethics Angle
Here’s what sets this apart from typical scraping tools: it’s a signed-agent that respects robots.txt and Cloudflare’s AI Crawl Control by default.
From their announcement:
“…making it easy for developers to comply with website rules, and making it less likely for crawlers to ignore web-owner guidance.”
The crawler self-identifies as a bot and honors crawl-delay directives. It cannot bypass Cloudflare bot detection or captchas. This is intentional — they want to be the “well-behaved” option in a space full of aggressive scrapers.
Incremental Crawling
For recurring crawls (monitoring, keeping RAG indexes fresh), you can use:
modifiedSince— skip pages that haven’t changedmaxAge— skip recently fetched pages
This saves time and cost on repeated crawls. Smart for documentation sites that update incrementally.
Static Mode
Not every site needs a full browser. Set render: false to fetch static HTML without spinning up a browser — faster and cheaper for static sites.
Pricing
Available on both Workers Free and Paid plans. The free tier has limits, but it’s enough to experiment.
When to Use This vs. Alternatives
Use Cloudflare /crawl when:
- You need to crawl many pages from a site
- You want markdown output for RAG
- You don’t want to run browser infrastructure
- You need to respect robots.txt (compliance matters)
Use local browser tools when:
- You need fine-grained control over page interactions
- You’re doing authenticated scraping
- You need to bypass bot detection (Cloudflare won’t help you here)
Use traditional HTTP scraping when:
- Sites are static and simple
- You need maximum speed
- You’re scraping at massive scale (cost considerations)
The Bigger Picture
Cloudflare is building an AI infrastructure stack: Workers AI for inference, Vectorize for embeddings, and now Browser Rendering for data ingestion. The pieces fit together — crawl sites, convert to markdown, chunk, embed, store in Vectorize, query with Workers AI.
They’re making the “build a RAG pipeline” path significantly shorter.
Links: