In early August, OpenAI released GPTBot
, a web crawler that collects data from the web for training purposes. Two weeks after its launch, 14.9% of the world’s 100 most popular websites and 9.2% of the top 1000 have blocked it.
According to Originality.ai, an AI and plagiarism checker, the most popular websites are more likely to restrict GPTBot
. The number of prominent websites blocking OpenAI’s web crawler is on the rise, with the number growing at approximately 5% every week. The first popular website to block GPTBot was Reuters. The current top three among the 100 most popular websites that followed Reuter’s steps are Amazon, Quora, and Indeed.
Interestingly, websites are far more likely to block OpenAI’s GPTBot than other web crawlers like Common Crawl Bot (CCBot) and Anthropic AI. In fact, none of the top 1000 websites has blocked Anthropic AI, while only 6.77% have disabled CCBot.
The increasing number of the world’s most popular websites disabling OpenAI’s data collection has sparked a debate about the potential accuracy and quality of future ChatGPT versions.
It seems that OpenAI is a bit late to the party with the launch of its proprietary web crawler after using much of the internet’s data to train ChatGPT. Previously, OpenAI relied on Common Crawl Bot’s datasets to train its AI models. It’s unclear why the startup decided to launch its own web crawler now and present sites with a choice to ban data collection.
One reason could be that having a proprietary web crawler would allow it to build its own datasets for training purposes. Some sources speculate that OpenAI has already been doing this without asking sites for permission. However, the increasing number of lawsuits against the company
related to unauthorized data usage might have prompted it to handle data collection with greater caution.
Blocking OpenAI’s data collection is a double-edged sword for websites. Currently, they might not be keen on the idea of OpenAI using their data to train ChatGPT and other AI models without compensation.
But if ChatGPT is once again connected to the internet
as a browser plugin, it could opt to drive traffic only to GPTBot-friendly sites. In that case, sites that have blocked GPTBot’s data collection could be easily overshadowed by those who have allowed it.