1. Website Planet
  2. >
  3. News
  4. >
  5. 15% Of the World’s Top Sites Block ChatGPT Data Collection
15% Of the World’s Top Sites Block ChatGPT Data Collection

15% Of the World’s Top Sites Block ChatGPT Data Collection

Ivana Shteriova
In early August, OpenAI released GPTBot, a web crawler that collects data from the web for training purposes. Two weeks after its launch, 14.9% of the world’s 100 most popular websites and 9.2% of the top 1000 have blocked it.

According to Originality.ai, an AI and plagiarism checker, the most popular websites are more likely to restrict GPTBot. The number of prominent websites blocking OpenAI’s web crawler is on the rise, with the number growing at approximately 5% every week. The first popular website to block GPTBot was Reuters. The current top three among the 100 most popular websites that followed Reuter’s steps are Amazon, Quora, and Indeed.

Interestingly, websites are far more likely to block OpenAI’s GPTBot than other web crawlers like Common Crawl Bot (CCBot) and Anthropic AI. In fact, none of the top 1000 websites has blocked Anthropic AI, while only 6.77% have disabled CCBot.

The increasing number of the world’s most popular websites disabling OpenAI’s data collection has sparked a debate about the potential accuracy and quality of future ChatGPT versions.

It seems that OpenAI is a bit late to the party with the launch of its proprietary web crawler after using much of the internet’s data to train ChatGPT. Previously, OpenAI relied on Common Crawl Bot’s datasets to train its AI models. It’s unclear why the startup decided to launch its own web crawler now and present sites with a choice to ban data collection.

One reason could be that having a proprietary web crawler would allow it to build its own datasets for training purposes. Some sources speculate that OpenAI has already been doing this without asking sites for permission. However, the increasing number of lawsuits against the company related to unauthorized data usage might have prompted it to handle data collection with greater caution.

Blocking OpenAI’s data collection is a double-edged sword for websites. Currently, they might not be keen on the idea of OpenAI using their data to train ChatGPT and other AI models without compensation.

But if ChatGPT is once again connected to the internet as a browser plugin, it could opt to drive traffic only to GPTBot-friendly sites. In that case, sites that have blocked GPTBot’s data collection could be easily overshadowed by those who have allowed it.

Rate this Article
5.0 Voted by 3 users
You already voted! Undo
This field is required Maximal length of comment is equal 80000 chars Minimal length of comment is equal 10 chars
Any comments?
View %s replies
View %s reply
More news
Show more
We check all user comments within 48 hours to make sure they are from real people like you. We're glad you found this article useful - we would appreciate it if you let more people know about it.
Popup final window
Share this blog post with friends and co-workers right now:

We check all comments within 48 hours to make sure they're from real users like you. In the meantime, you can share your comment with others to let more people know what you think.

Once a month you will receive interesting, insightful tips, tricks, and advice to improve your website performance and reach your digital marketing goals!

So happy you liked it!

Share it with your friends!

1 < 1 1

Or review us on 1