
This protocol will make AI companies pay for training data
What's the story
In the wake of Anthropic's $1.5 billion copyright settlement, the artificial intelligence (AI) sector is grappling with its training data dilemma. The issue stems from numerous lawsuits over unlicensed data usage, including one against Midjourney for generating copyrighted images. To mitigate potential lawsuits that could cripple the industry, a group of technologists and web publishers have introduced a system called Real Simple Licensing (RSL).
Licensing solution
RSL co-founder discusses the need for licensing
RSL, co-founded by Eckart Walther (co-creator of the RSS standard), aims to create a scalable training data licensing system across the internet. "We need to have machine-readable licensing agreements for the internet," Walther told TechCrunch. The protocol outlines specific licensing terms that publishers can set for their content, whether AI firms need a custom license or to adopt Creative Commons provisions.
Implementation
How the RSL protocol works
The RSL Protocol allows participating websites to include terms in their "robots.txt" file in a prearranged format. This makes it easy to identify which data falls under which terms. On the legal side, the RSL team has created a collective licensing organization, the RSL Collective, that can negotiate terms and collect royalties. This is akin to ASCAP for musicians or MPLC for films.
Collective participation
Major web publishers join RSL Collective
A number of major web publishers have joined the RSL Collective, including Yahoo, Reddit, Medium, O'Reilly Media, Ziff Davis (Mashable and CNET's owner), Internet Brands (WebMD's owner), People Inc, and The Daily Beast. Some firms like Fastly, Quora, and Adweek are backing the standard without joining the collective. Notably, some members already have licensing deals in place, such as Reddit, which gets an estimated $60 million annually from Google for its training data.
Adoption hurdles
Challenges in determining royalties for training data
AI models present unique challenges in determining when royalties are due for specific training data. The issue is simplest for products like Google's AI Search Abstracts, that draw data from the web in real time and maintain strict attribution for each fact. However, if training isn't logged when it takes place, confirming that a given document was ingested into an LLM can be nearly impossible.