Enable Cloudflare's new AI Scrapers and Crawlers block

Steve Krause · Sunday at 04:48 PM

Although I can't find a mention on the official Cloudflare blog, a new setting just showed up for all Cloudflare plans which allows you to block known AI Scrapers and Crawlers. These are bots which crawl your site to train LLM's.

Cloudflare > Security > Bots > Configure Super Bot Fight Mode > AI Scrapers and Crawlers > On/Off Toggle

Looking a bit deeper, it appears what it's actually doing is just adding a new WAF rule configured to block known AI Scrapers.

Looking at the WAF rule, it's blocking all Verified Bots that have a category of AI Crawler. Pretty straight forward if you want to customize the rule or add it to an existing rule.

Before enabling it, my next question was, "What exactly is Cloudflare blocking?"

Big thanks to Cloudflare for documenting how each Verified Bot is categorized on its Verified Bots page.

The page also has a search to easily narrow down who the AI Crawlers are (as of 6/30/2024).

Unfortunately, the list only includes the top 3 platforms. Although this will help prevent the continued theft of copyrighted content from Amazon, Google, and OpenAI, it doesn't appear to be blocking other AI crawlers yet.

Anyway, it's a curious finding this morning while drinking my coffee....

shockersh · Sunday at 11:10 PM

That's all great but also a bit risky... Never know how Google many punish you from an Organic standpoint if you don't feed it. Thoughts?

Jeff Butts · Monday at 06:42 PM

19 hours ago, shockersh said:

Never know how Google many punish you from an Organic standpoint

Google claims it doesn't....but I'm not sure how much I trust that claim. That being said, it can't really punish a site much more than it would by scraping all of its content and then feeding it to users without ever sending the traffic to our pages,

Steve Krause · Thursday at 12:21 AM

All good points. The concern... I get it. Block Google from training on your content, and you could risk being penalized in rankings down the line.

This is something everyone will need to consider when it comes to allowing or blocking search engines and AI companies from crawling your content. Much of this can also be done in the robots.txt file:

User-agent: Google-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: PerplexityBot
Disallow: /

Unfortunately, what I'm finding as I dig into my logs is that there are many, many, many bots from around the globe that just escape with no regard to your robots.txt file, so hard blocking them is the only option. For example, a bot from Singapore scraped several T from the blog before I chopped it. It's a whack-a-mole game....

Sign In

Enable Cloudflare's new AI Scrapers and Crawlers block

Recommended Posts

Steve Krause

Link to comment

Share on other sites

shockersh

Link to comment

Share on other sites

Jeff Butts

Link to comment

Share on other sites

Steve Krause

Link to comment

Share on other sites

Join the conversation

Similar Content

Are Copilot PCs the Future of Personal Computing?

Who's Online 0 Members, 0 Anonymous, 58 Guests (See full list)

Forums

Activity