gguy/ Shuttershock
26 June 2024CopyrightLiz Hockley

AI licensor Reddit blocks web ‘crawlers’

Social platform to make changes to robots.txt file | Company says it will take measures against unauthorised ‘large-scale’ access to content.

Reddit has said it intends to change how the platform can be crawled by third parties, telling users it had seen an “uptick in obviously commercial entities who scrape Reddit and argue they are not bound by our terms or policies”.

In an announcement shared on Monday, June 25, Reddit said it would update its Robots Exclusion Protocol or robots.txt file, which contains instructions for bots on which parts of a website they can and cannot access.

The social media platform said it was selective about who it worked with and trusted with “large-scale access” to its content.

Reddit has entered into several licensing deals with artificial intelligence (AI) companies this year including a deal with Google announced in February worth a reported $60 million a year, and a partnership with OpenAI declared last month.

However, the company told users in May that it was seeing more and more commercial entities using unauthorised access or misusing authorised access to collect public data in bulk, including its own content.

A recent Wired investigation suggested that AI startup Perplexity was “ignoring” the Robots Exclusion Standard to “surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t”.

“While we will continue our efforts to block known bad actors, we need to do more to restrict access to Reddit public content at scale to trusted actors who have agreed to abide by our policies,” said the social media forum.

It said the change to the robots.txt file would help it “enforce this policy”.

Along with the updated file, Reddit said it would continue rate-limiting—which restricts the number of requests a person or bot can make within a certain timeframe—and blocking unknown bots and crawlers from accessing its website.

“Good faith actors” including researchers and organisations such as the Internet Archive would still have access to Reddit content for non-commercial use, the platform said.

Licensing AI training data ‘big business’

Giles Parson, a partner at Browne Jacobson, noted the value of Reddit content for AI companies seeking to train large language models (LLMs) in a recent article for WIPR.

“Reddit is full of content itself but, importantly, it also contains signposts about a substantial amount of good content elsewhere on the internet,” he wrote.

The company issued a warning to AI companies that they would face legal action if found extracting data from the website without official permission.

“Licensing AI training data has gone from nothing to big business as AI has taken off,” said Parson.

A representative for Reddit said on Tuesday that the social forum would be updating its robots.txt instructions “to be as clear as possible”.

“If you are using an automated agent to access Reddit, you need to abide by our terms and policies, and you need to talk to us.

“We believe in the open internet, but we do not believe in the misuse of public content,” they said.

Did you enjoy reading this story?  Sign up to our free daily newsletters and get stories sent like this straight to your inbox

Already registered?

Login to your account

To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.

Two Weeks Free Trial

For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk