Re-post: Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

Jul 12, 2024

Sean Wright

To AI or not to AI - that is a question that many marketers are asking themselves today. Not only are they wondering if they should use GenAI tools themselves, but they are also confronting the question "Should we allow another company's AI models to be trained on our content?"

Some have been arguing that traditional search experiences will either be replaced with large-language model (LLM) based information retrieval or it at least merge with it. Others argue that traditional search won't be replaced by LLMs (at least for now).

That said, some of the arguments presented by Algolia, in their blog post Why ChatGPT won’t replace search engines any time soon have already been countered with technologies like retrieval augmented generation).

Ok, let's assume that traditional SEO-supported search will continue to co-exist with LLM experiences like ChatGPT. If customers are using both, marketers will want their message to appear in both. That means ensuring ChatGPT knows something about your company is a good thing!

It's just like a site that doesn't optimize for SEO and ends up ranking very low in search results. If you're not part of the LLM's model then no one using it will know about you, right?

Well, some content owners feel they need to push back, like the NY Times when they sued OpenAI and Microsoft in late 2023 and some authors who sued Meta, Microsoft, and Bloomberg (yes, Bloomberg is working on its own commercial LLM product).

These groups either want to be compensated for the use of their content or entirely prevent the LLMs from accessing it. The NY Times' argues that OpenAI and Microsoft's LLM based services return content that is nearly identical to what the publication hosts on its own website - why would people visit their site and pay for a subscription if they can get the answers from a free version of ChatGPT? Is OpenAI profiting off copyright infringement?

Additionally, the "bots" that scrape websites for content that these LLMs are trained on cost the website owners bandwidth and resources. When Google's search crawler bot visits your site you are assuming you'll be compensated in traffic and therefore revenue (assuming you use a modern DXP like Xperience by Kentico so you can capitalize on that engagement 😉). But this isn't guaranteed with LLM services that might just "give" the answer to users - potentially without attribution - meaning there's no reason for them to come and visit your site or subscribe to your emails.

These are thorny questions and might lead some marketers to question if they allow their content to be used for LLM training or at least limit which AI bot web crawlers can scrape their content.

Cloudflare has recently announced a tool to help website owners block AI bot traffic in their blog post Declare your AIndependence: block AI bots, scrapers and crawlers with a single click.

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots.

It's interesting to note this feature wasn't only enabled for Cloudflare's enterprise tier customers!

It’s available for all customers, including those on our free tier.

In the feature announcement post, Cloudflare details all the data they've gathered about AI bot traffic across their giant network noting which AI bots make the most number of requests.

We looked at common AI crawler user agents and aggregated the number of requests on our platform from these AI user agents over the last year [...] When looking at the number of requests made to Cloudflare sites, we see that Bytespider, Amazonbot, ClaudeBot, and GPTBot are the top four AI crawlers. Operated by ByteDance, the Chinese company that owns TikTok, Bytespider is reportedly used to gather training data for its large language models (LLMs), including those that support its ChatGPT rival, Doubao.

They also note which customers have the most traffic and which customers block the most AI bots.

Moreover, the higher-ranked (more popular) an Internet property is, the more likely it is to be targeted by AI bots, and correspondingly, the more likely it is to block such requests.

This seems to be inline with the lawsuit from the NY Times - sites focused on content (which top sites tend to be) potentially have the most to lose and are taking a careful approach.

Cloudflare also notes that some AI bots are not respecting the robots.txt exclusion protocol and are even pretending to be real browsers, so they plan to continue updating their AI bot blocking capabilities to deal with this.

We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.

Kentico uses Cloudflare as a web application firewall and content delivery network for Xperience by Kentico solutions running in our SaaS environment. If you are using or are interested in our SaaS service and Cloudflare's AI bot blocking, send us some feedback!

Wrap up

So, where do you stand in all of this? I'm sure, like me, you are using services like ChatGPT and Claude on a regular (if not daily) basis.

Maybe there needs to be some clearer agreement between sites and AI service providers or a better understanding of the payoff for content owners - the benefit for companies like OpenAI is very clear.

Go read the original Cloudflare feature announcement and decide if you are going to opt-out of the AI revolution... at least for your own website.

Join the conversation (1 comment)

Share on:

Sean Wright

I'm Lead Product Evangelist at Kentico. I work in the Product Marketing department at Kentico along with Matej Stefanik, Miroslav Jirku, and James Turner. My responsibilities include helping partners, customers, and the entire Kentico community understand the strategy and value of Xperience by Kentico. I'm also responsible for the Kentico MVP Program.

Privacy settings

Notifications

Re-post: Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

Wrap up

Sean Wright