Google can train AI using web content despite publisher opt-out

By Dwaipayan Roy

May 04, 2025

10:27 am

What's the story

Google's capability to use web content for training its AI models, even when publishers have opted out, has been questioned. A recent court testimony revealed that the tech giant can train its search-specific AI products on data from across the internet. This is despite some publishers opting out of Google's AI training program. Eli Collins, a Vice President at Google DeepMind, confirmed this in court.

Legal proceedings

Google's AI training practices under legal scrutiny

During a recent court case, which looked into Google's search market dominance, Collins testified that the search division within Google can further train their models using data from publishers who opted out. The revelation came after Diana Aguilar, an attorney for the Department of Justice (DoJ), asked if the search division could train on data that publishers had chosen not to provide. Collins confirmed the practice was indeed correct for use in search.

Publisher concerns

Impact on publishers and Google's response

The use of data from websites to generate AI responses has raised concerns among publishers. They argue that Google's AI summarization of search results could discourage users from visiting independent websites, thereby impacting their revenue. Addressing these concerns, Google clarified that publishers can only refuse their data being used in search AI if they opt out of being indexed for search entirely.

DOJ's position

US Department of Justice's stance on Google's practices

The DoJ is pushing for measures to bring back competition in online search. This includes possible restrictions on Google's AI practices and a proposal for Google to sell its popular Chrome browser and share critical data used for generating search results. The DoJ has also proposed stopping Google from paying to be the default search engine on other apps and devices, a restriction that would apply to its AI offerings like Gemini.

Data disclosure

Google's AI training data revealed in court

During the court proceedings, Aguilar submitted a document called "Search GenAI Gemini v3." It showed that Google had deleted 80 billion of 160 billion "tokens" or content snippets after filtering out material that publishers opted out of allowing Google to use for AI training. The document also listed search "sessions data" and YouTube videos as other data sources that could improve Google's AI models. Collins confirmed this when Judge Amit Mehta asked him.

AI development

Google's exploration of AI training using search data

Collins also testified that Google has considered how its AI models could be greatly improved by the data it has accumulated over years of running the world's most popular search engine. This came to light during cross-examination when Aguilar presented a briefing document meant for Demis Hassabis, CEO of Google DeepMind. The document proposed training an unidentified Google AI model with extensive search data, including search rankings, to determine the level of improvement compared to an untrained model.