Jump to content
groovyPost Forums

Enable Cloudflare's new AI Scrapers and Crawlers block


Recommended Posts

Although I can't find a mention on the official Cloudflare blog, a new setting just showed up for all Cloudflare plans which allows you to block known AI Scrapers and Crawlers. These are bots which crawl your site to train LLM's.

Cloudflare > Security > Bots > Configure Super Bot Fight Mode > AI Scrapers and Crawlers > On/Off Toggle

image.thumb.png.42e1921cda51e98312df6c8852327562.png

Looking a bit deeper, it appears what it's actually doing is just adding a new WAF rule configured to block known AI Scrapers.

image.thumb.png.1481842744f051e066d3a83606f0db8e.png

Looking at the WAF rule, it's blocking all Verified Bots that have a category of AI Crawler. Pretty straight forward if you want to customize the rule or add it to an existing rule.

image.thumb.png.9ec776842337307844bd5d8f0267d2db.png

Before enabling it, my next question was, "What exactly is Cloudflare blocking?"

Big thanks to Cloudflare for documenting how each Verified Bot is categorized on its Verified Bots page.

 The page also has a search to easily narrow down who the AI Crawlers are (as of 6/30/2024).

image.thumb.png.d66d3b5317c101e791e30f1eaf3b442b.png

Unfortunately, the list only includes the top 3 platforms. Although this will help prevent the continued theft of copyrighted content from Amazon, Google, and OpenAI, it doesn't appear to be blocking other AI crawlers yet.

Anyway, it's a curious finding this morning while drinking my coffee....

  • Upvote 1
Link to comment
Share on other sites

19 hours ago, shockersh said:

Never know how Google many punish you from an Organic standpoint

Google claims it doesn't....but I'm not sure how much I trust that claim. That being said, it can't really punish a site much more than it would by scraping all of its content and then feeding it to users without ever sending the traffic to our pages,

  • Upvote 1
Link to comment
Share on other sites

All good points. The concern... I get it. Block Google from training on your content, and you could risk being penalized in rankings down the line.

This is something everyone will need to consider when it comes to allowing or blocking search engines and AI companies from crawling your content. Much of this can also be done in the robots.txt file:

User-agent: Google-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: PerplexityBot
Disallow: /

Unfortunately, what I'm finding as I dig into my logs is that there are many, many, many bots from around the globe that just escape with no regard to your robots.txt file, so hard blocking them is the only option. For example, a bot from Singapore scraped several T from the blog before I chopped it. It's a whack-a-mole game....

Link to comment
Share on other sites

Lettings this run for a few days now. Unfortunately, it only blocks Google, Amazon, and Apple. It would be nice if it also blocked Perplexity, Claude, etc. Granted, those are the only bots categorized (for now), as I showed in the screenshot.

Will see how well they honor the Robots.txt file. Unfortunately, I think we know the answer to that...

 

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...