I remember reading a CF blog post about crawler separation and responsible AI bot principles where they argue every bot should have one distinct purpose. Now they're building crawling infrastructure themselves, and their own /crawl endpoint lists "training AI systems" as a use case alongside regular crawling. So not only are they in the crawling business now, they're not following the separation principle. To be fair, there's a business logic here. But it's hard not to notice the irony.
https://blog.cloudflare.com/uk-google-ai-crawler-policy/
jasongill
I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?
Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.
show comments
RamblingCTO
Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.
show comments
ljm
Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.
And they can pull it off because of their reach over the internet with the free DNS.
show comments
shubhamintech
IMO the under-discussed risk here is that sites will start serving different content to verified crawlers vs real users. You're already seeing it with known search bots getting sanitized views. If your agent's context comes from a crawl the site knows is going to an AI, you have no guarantee it matches what a human sees, and that data quality problem won't surface until your agent starts acting on selectively curated information.
This could go wrong on same levels.
show comments
Lasang
The idea of exposing a structured crawl endpoint feels like a natural evolution of robots.txt and sitemaps.
If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.
It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.
show comments
m3047
Seems like it was just hours ago they started reaching out to my edge servers from their address space (Me: why is a reverse proxy service banging my servers when I'm not a customer? did some miscreant sign me up somehow?) and it was for Apple, privacy, mom and pie (a VPN service, dressed in noble aspirations). It never quite smelled like pie to me.
If you're doing threat hunting / risk enumeration, Cloudflare is no longer a passive service that miscreants hide behind, they now actively reach out and grab your privates.
ramblurr
It seems like there's a missed use case: web archiving. I don't see any mention of WARC as an output format. This could be useful to journalists and academically if they had it.
show comments
kelvinjps10
Could they collaborate with the website's creators that have websites behind cloudfare to allow their content to be accessed via an API in exchange of a compensation?.
This could be a way to compensate creators and AI companies be able to access content that's unreachable as it's protected by cloudfare
show comments
keeda
If anyone is taking feature requests, could you add an option to return the snapshot as an MHTML with all static assets embedded? (I know this could get inefficient from a storage perspective, but if it really matters you could dedupe assets on your end, which is what my janky homegrown crawler does.)
arjie
Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.
show comments
everfrustrated
Will this crawler be run behind or infront of their bot blocker logic?
show comments
devnotes77
Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed.
Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.
The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.
The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.
show comments
andrethegiant
I tried to make exactly this a year ago. Built on Cloudflare using all of their primitives: https://crawlspace.dev -- It didn't work too well (so don't bother trying it).
stevenhubertron
Queue-It protected pages catch it as well and prevent crawling.
pupppet
Cloudflare getting all the cool toys. AWS, anyone awake over there?
jppope
This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.
radicalriddler
Interesting... I built an MCP server for their initial browser render as markdown, and I just tell the LLM to follow reasonable links to relative content, and recurse the tool.
binarymax
Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.
skybrian
If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?
show comments
carloslfu
They have a Pay Per Crawl option for owners. This plus a /crawl endpoint is genius.
triwats
this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring
fbrncci
Awesome, so I no longer have to use Firecrawl or my own crawler to scrape entire websites for an agent? Especially when needing residential proxies to do so on Cloudflare protected sites? Why though?
show comments
arjunchint
RIP @FireCrawl or at the very least they were the inspiration for this?
radium3d
Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.
```
write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP.
```
Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura
show comments
superkuh
Cloudflare are mafiosos. They create the problem and then sell you the solution to themselves.
Normal_gaussian
"Well-behaved bot - Honors robots.txt directives, including crawl-delay"
From the behaviour of our peers, this seems to be the real headline news.
bobpaw
TIL about the Crawl-delay directive. Although it seems that most honest bots move slower and dishonest bots will learn to.
branoco
Interested how this unfolds
ed_mercer
> Honors robots.txt directives, including crawl-delay
Sounds pretty useless for any serious AI company
show comments
allixsenos
"Selling the wall and the ladder."
"Biggest betrayal in tech."
"Protection racket."
These hot takes sound smart but they're not.
The web was built to be open and available to everyone. Serving static HTML from disk back in the day, nobody could hurt you because there was nothing to hurt.
We need bot protection now because everything is dynamic, straight from the database with some light caching for hot content. When Facebook decides to recrawl your one million pages in the same instant, you're very much up shit creek without a paddle. A bot that crawls the full site doesn't steal anything, but it does take down the origin server. My clients never call me upset that a bot read their blog posts. They call because the bot knocked the site offline for paying customers.
Bot protection protects availability, not secrecy.
And the real bot problem isn't even crawling. It's automated signups. Fake accounts messaging your users. Bots buying out limited drops before a human can load the page. Like-farming. Credential stuffing. That's what bot protection is actually for: preventing fraud, not preventing someone from reading your public website.
Cloudflare's `/crawl` respects robots.txt. Don't want your content crawled, opt out. But if you want it indexed and can't handle the traffic spike, this gets your content out without hammering production.
As for the folks saying Cloudflare should keep blocking all crawlers forever: AI agents already drive real browsers. They click, scroll, render JavaScript. Go look at what browser automation frameworks can do today and then explain to me how you tell a bot from a person. That distinction is already gone. The hot takes are about a version of the internet that doesn't exist anymore.
babelfish
Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?
show comments
8cvor6j844qw_d6
Does this bypass their own anti-AI crawl measures?
I'll need to test it out, especially with the labyrinth.
show comments
iranu
Honestly, it feels like cloudflare bullying other sites into using their anti-bot services. great business model by charging owners and devs at the same time. Using AI per page to parse content. its reckless.
coreq
The big question here is this a verified-bot on the Cloudflare WAF? Didn't Google get into trouble for using their search engine user agent and IPs to feed Gemini in Europe?
devnotes77
To clarify the two questions raised:
First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.
Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.
The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.
show comments
1vuio0pswjnm7
Can a CDN be a "walled garden"
memothon
I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!
charcircuit
>Honors robots.txt
Is it possible to ignore robot.txt in the case the crawl was triggered by a human?
greatgib
All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block LLMs and bots to come scraping them. Look how bad it is.
And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.
show comments
tjpnz
Do I have the option to fill it with junk for LLMs?
Imustaskforhelp
This might be really great!
I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.
I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.
And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.
So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?
Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's
Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.
show comments
kordlessagain
Fuck Cloudflare.
rvz
Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.
sourcecodeplz
I love this from CloudFlare!
pqdbr
Off-topic, but I'm having a terrible experience with Cloudflare and would love to know if someone could offer some help.
All of a sudden, about 1/3 of all traffic to our website is being routed via EWR (New York) - me included -, even tough all our users and our origin servers are in Brazil.
We pay for the Pro plan but support has been of no help: after 20 days of 'debugging' and asking for MTRs and traceroutes, they told us to contact Claro (which is the same as telling me to contact Verizon) because 'it's their fault'.
I remember reading a CF blog post about crawler separation and responsible AI bot principles where they argue every bot should have one distinct purpose. Now they're building crawling infrastructure themselves, and their own /crawl endpoint lists "training AI systems" as a use case alongside regular crawling. So not only are they in the crawling business now, they're not following the separation principle. To be fair, there's a business logic here. But it's hard not to notice the irony. https://blog.cloudflare.com/uk-google-ai-crawler-policy/
I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?
Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.
Doesn't work for pages protected by cloudflare in my experience. What a shame, they could've produced the problem and sold the solution.
Is cloudflare becoming a mob outfit? Because they are selling scraping countermeasures but are now selling scraping too.
And they can pull it off because of their reach over the internet with the free DNS.
IMO the under-discussed risk here is that sites will start serving different content to verified crawlers vs real users. You're already seeing it with known search bots getting sanitized views. If your agent's context comes from a crawl the site knows is going to an AI, you have no guarantee it matches what a human sees, and that data quality problem won't surface until your agent starts acting on selectively curated information.
This could go wrong on same levels.
The idea of exposing a structured crawl endpoint feels like a natural evolution of robots.txt and sitemaps.
If more sites provided explicit machine-readable entry points for crawlers, indexing could become a lot less wasteful. Right now crawlers spend a lot of effort rediscovering the same structure over and over.
It also raises interesting questions about whether sites will eventually provide different views for humans vs. automated agents in a more formalized way.
Seems like it was just hours ago they started reaching out to my edge servers from their address space (Me: why is a reverse proxy service banging my servers when I'm not a customer? did some miscreant sign me up somehow?) and it was for Apple, privacy, mom and pie (a VPN service, dressed in noble aspirations). It never quite smelled like pie to me.
If you're doing threat hunting / risk enumeration, Cloudflare is no longer a passive service that miscreants hide behind, they now actively reach out and grab your privates.
It seems like there's a missed use case: web archiving. I don't see any mention of WARC as an output format. This could be useful to journalists and academically if they had it.
Could they collaborate with the website's creators that have websites behind cloudfare to allow their content to be accessed via an API in exchange of a compensation?. This could be a way to compensate creators and AI companies be able to access content that's unreachable as it's protected by cloudfare
If anyone is taking feature requests, could you add an option to return the snapshot as an MHTML with all static assets embedded? (I know this could get inefficient from a storage perspective, but if it really matters you could dedupe assets on your end, which is what my janky homegrown crawler does.)
Oh man, I was hoping I could offer a nicely-crawled version of my site. It would be cool if they offered that for site admins. Then everyone who wanted to crawl would just get a thing they could get for pure transfer cost. I suppose I could build one by submitting a crawl job against myself and then offering a `static.` subdomain on each thing that people could access. Then it's pure HTML instant-load.
Will this crawler be run behind or infront of their bot blocker logic?
Worth noting: origin owners can still detect and block CF Browser Rendering requests if needed.
Workers-originated requests include a CF-Worker header identifying the workers subdomain, which distinguishes them from regular CDN proxying. You can match on this in a WAF rule or origin middleware.
The trickier issue: rendered requests originate from Cloudflare ASN 13335 with a low bot score, so if you rely on CF bot scores for content protection, requests through their own crawl product will bypass that check. The practical defense is application-layer rate limiting and behavioral analysis rather than network-level scores -- which is better practice regardless.
The structural conflict is real but similar to search engines offering webmaster tools while running the index. The incentives are misaligned, but the individual products have independent utility. The harder question is whether the combination makes it meaningfully harder to build effective bot protection on top of their platform.
I tried to make exactly this a year ago. Built on Cloudflare using all of their primitives: https://crawlspace.dev -- It didn't work too well (so don't bother trying it).
Queue-It protected pages catch it as well and prevent crawling.
Cloudflare getting all the cool toys. AWS, anyone awake over there?
This is actually really amazing. Cloudflare is just skating to where the puck is going to be on this one.
Interesting... I built an MCP server for their initial browser render as markdown, and I just tell the LLM to follow reasonable links to relative content, and recurse the tool.
Really hard to understand costs here. What is a reasonable pages per second? Should I assume with politeness that I'm basically at 1 page per second == 3600 pages/hour? Seems painfully slow.
If two customers crawl the same website and it uses crawl-delay, how does it handle that? Are they independent, or does each one run half as fast?
They have a Pay Per Crawl option for owners. This plus a /crawl endpoint is genius.
this could be cool to use cloudflare's edge to do some monitoring of endpoints actual content for synthetic monitoring
Awesome, so I no longer have to use Firecrawl or my own crawler to scrape entire websites for an agent? Especially when needing residential proxies to do so on Cloudflare protected sites? Why though?
RIP @FireCrawl or at the very least they were the inspiration for this?
Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal.
``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ```
Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura
Cloudflare are mafiosos. They create the problem and then sell you the solution to themselves.
"Well-behaved bot - Honors robots.txt directives, including crawl-delay"
From the behaviour of our peers, this seems to be the real headline news.
TIL about the Crawl-delay directive. Although it seems that most honest bots move slower and dishonest bots will learn to.
Interested how this unfolds
> Honors robots.txt directives, including crawl-delay
Sounds pretty useless for any serious AI company
"Selling the wall and the ladder."
"Biggest betrayal in tech."
"Protection racket."
These hot takes sound smart but they're not.
The web was built to be open and available to everyone. Serving static HTML from disk back in the day, nobody could hurt you because there was nothing to hurt.
We need bot protection now because everything is dynamic, straight from the database with some light caching for hot content. When Facebook decides to recrawl your one million pages in the same instant, you're very much up shit creek without a paddle. A bot that crawls the full site doesn't steal anything, but it does take down the origin server. My clients never call me upset that a bot read their blog posts. They call because the bot knocked the site offline for paying customers.
Bot protection protects availability, not secrecy.
And the real bot problem isn't even crawling. It's automated signups. Fake accounts messaging your users. Bots buying out limited drops before a human can load the page. Like-farming. Credential stuffing. That's what bot protection is actually for: preventing fraud, not preventing someone from reading your public website.
Cloudflare's `/crawl` respects robots.txt. Don't want your content crawled, opt out. But if you want it indexed and can't handle the traffic spike, this gets your content out without hammering production.
As for the folks saying Cloudflare should keep blocking all crawlers forever: AI agents already drive real browsers. They click, scroll, render JavaScript. Go look at what browser automation frameworks can do today and then explain to me how you tell a bot from a person. That distinction is already gone. The hot takes are about a version of the internet that doesn't exist anymore.
Didn't they just throw a (very public) fit over Perplexity doing the exact same thing?
Does this bypass their own anti-AI crawl measures?
I'll need to test it out, especially with the labyrinth.
Honestly, it feels like cloudflare bullying other sites into using their anti-bot services. great business model by charging owners and devs at the same time. Using AI per page to parse content. its reckless.
The big question here is this a verified-bot on the Cloudflare WAF? Didn't Google get into trouble for using their search engine user agent and IPs to feed Gemini in Europe?
To clarify the two questions raised:
First, the Cloudflare Crawl endpoint does not require the target site to use Cloudflare. It spins up a headless Chrome instance (via the Browser Rendering API) that fetches and renders any publicly accessible URL. You could crawl a site hosted on Hetzner or a bare VPS with the same call.
Second on pricing: Browser Rendering is only available on the Workers Paid plan ($5/month). It is not part of the free tier. Usage is billed per invocation beyond the included quota - the exact limits are in the Cloudflare docs under Browser Rendering pricing, but for archival use cases with moderate crawl rates you are very unlikely to run into meaningful costs.
The practical gotcha for forum archival is pagination and authentication-gated content. If the forum requires a login to see older posts, a headless browser session with saved cookies would help, but that is more complex to orchestrate than a single-shot fetch.
Can a CDN be a "walled garden"
I've used browser rendering at work and it's quite nice. Most solutions in the crawling space are kind of scummy and designed for side-stepping robots.txt and not being a good citizen. A crawl endpoint is a very necessary addition!
>Honors robots.txt
Is it possible to ignore robot.txt in the case the crawl was triggered by a human?
All what was expected, first they do a huge campaign to out evil scrapers. We should use their service to ensure your website block LLMs and bots to come scraping them. Look how bad it is.
And once that is well setup, and they have their walled garden, then they can present their own API to scrape websites. All well done to be used by your LLM. But as you know, they are the gate keeper so that the Mafia boss decide what will be the "intermediary" fee that is proper for itself to let you do what you were doing without intermediary before.
Do I have the option to fill it with junk for LLMs?
This might be really great!
I had the idea after buying https://mirror.forum recently (which I talked in discord and archiveteam irc servers) that I wanted to preserve/mirror forums (especially tech) related [Think TinyCoreLinux] since Archive.org is really really great but I would prefer some other efforts as well within this space.
I didn't want to scrape/crawl it myself because I felt like it would feel like yet another scraping effort for AI and strain resources of developers.
And even when you want to crawl, the issue is that you can't crawl cloudflare and sometimes for good measure.
So in my understanding, can I use Cloudflare Crawl to essentially crawl the whole website of a forum and does this only work for forums which use cloudflare ?
Also what is the pricing of this? Is it just a standard cloudflare worker so would I get free 100k requests and 1 Million per the few cents (IIRC) offer for crawling. Considering that Cloudflare is very scalable, It might even make sense more than buying a group of cheap VPS's
Also another point but I was previously thinking that the best way was probably if maintainers of these forums could give me a backup archive of the forum in a periodic manner as my heart believes it to be most cleanest way and discussing it on Linux discord servers and archivers within that community and in general, I couldn't find anyone who maintains such tech forums who can subscribe to the idea of sharing the forum's public data as a quick backup for preservation purposes. So if anyone knows or maintains any forums myself. Feel free to message here in this thread about that too.
Fuck Cloudflare.
Selling the cure (DDoS protection) and creating the poison (Authorized AI crawling) against their customers.
I love this from CloudFlare!
Off-topic, but I'm having a terrible experience with Cloudflare and would love to know if someone could offer some help.
All of a sudden, about 1/3 of all traffic to our website is being routed via EWR (New York) - me included -, even tough all our users and our origin servers are in Brazil.
We pay for the Pro plan but support has been of no help: after 20 days of 'debugging' and asking for MTRs and traceroutes, they told us to contact Claro (which is the same as telling me to contact Verizon) because 'it's their fault'.