This protocol will make AI companies pay for training data

By Dwaipayan Roy

Sep 10, 2025

07:27 pm

What's the story

In the wake of Anthropic's $1.5 billion copyright settlement, the artificial intelligence (AI) sector is grappling with its training data dilemma. The issue stems from numerous lawsuits over unlicensed data usage, including one against Midjourney for generating copyrighted images. To mitigate potential lawsuits that could cripple the industry, a group of technologists and web publishers have introduced a system called Real Simple Licensing (RSL).

Licensing solution

RSL co-founder discusses the need for licensing

RSL, co-founded by Eckart Walther (co-creator of the RSS standard), aims to create a scalable training data licensing system across the internet.

"We need to have machine-readable licensing agreements for the internet," Walther told TechCrunch.

The protocol outlines specific licensing terms that publishers can set for their content, whether AI firms need a custom license or to adopt Creative Commons provisions.

Implementation

How the RSL protocol works

The RSL Protocol allows participating websites to include terms in their "robots.txt" file in a prearranged format.

This makes it easy to identify which data falls under which terms.

On the legal side, the RSL team has created a collective licensing organization, the RSL Collective, that can negotiate terms and collect royalties.

This is akin to ASCAP for musicians or MPLC for films.

Collective participation

Major web publishers join RSL Collective

A number of major web publishers have joined the RSL Collective, including Yahoo, Reddit, Medium, O'Reilly Media, Ziff Davis (Mashable and CNET's owner), Internet Brands (WebMD's owner), People Inc, and The Daily Beast.

Some firms like Fastly, Quora, and Adweek are backing the standard without joining the collective.

Notably, some members already have licensing deals in place, such as Reddit, which gets an estimated $60 million annually from Google for its training data.

Adoption hurdles

Challenges in determining royalties for training data

AI models present unique challenges in determining when royalties are due for specific training data.

The issue is simplest for products like Google's AI Search Abstracts, that draw data from the web in real time and maintain strict attribution for each fact.

However, if training isn't logged when it takes place, confirming that a given document was ingested into an LLM can be nearly impossible.