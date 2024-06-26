Reddit is updating its policies in an apparent effort to crack down on AI companies scraping the site for content to train AI models.

Reddit is a popular place for AI companies to scrape, thanks to the large quantity of user-generated content on a vast array of subjects. Reddit has signed a deal with Google allowing the company to use the site’s content, but other companies appear to be continuing their efforts to scrape the site.

The company says it will make changes to address the issue.

In the coming weeks, we’ll update our Robots Exclusion Protocol (robots.txt file), which gives high-level instructions about how we do and don’t allow Reddit to be crawled by third parties. Along with our updated robots.txt file, we will continue rate-limiting and/or blocking unknown bots and crawlers from accessing reddit.com. This update shouldn’t impact the vast majority of folks who use and enjoy Reddit. Good faith actors – like researchers and organizations such as the Internet Archive – will continue to have access to Reddit content for non-commercial use.

Mark Graham, Director, Wayback Machine at Internet Archive, praised Reddit’s position.

“The Internet Archive is grateful that Reddit appreciates the importance of helping to ensure the digital records of our times are archived and preserved for future generations to enjoy and learn from,” said Graham. “Working in collaboration with Reddit we will continue to record and make available archives of Reddit, along with the hundreds of millions of URLs from other sites we archive every day.”

Reddit emphasized that organizations must abide by its policies.