Page Loader
Summarize
Is your YouTube content powering AI? Apple, NVIDIA's practices exposed 
Apple used YouTube transcripts to train its OpenELM model

Is your YouTube content powering AI? Apple, NVIDIA's practices exposed 

Jul 17, 2024
09:58 am

What's the story

Several leading technology companies, including Apple, NVIDIA, and Anthropic, have reportedly utilized transcripts from over 170,000 YouTube videos to train their artificial intelligence (AI) models without obtaining permission from the content creators. The transcripts were obtained from a dataset created by EleutherAI, a non-profit organization that downloaded subtitle files from more than 48,000 channels on YouTube. This dataset is part of a larger compilation known as "The Pile," primarily intended for use by small developers and academics.

Dataset details

"The Pile" dataset: A resource for AI training

According to a research paper published by EleutherAI, most of the datasets in "The Pile" are accessible and open to anyone on the internet with sufficient storage space and computing power. Apple reportedly used this dataset to train OpenELM, a high-profile model released in April. This was weeks before Apple announced it would add new AI capabilities to iPhones and MacBooks. NVIDIA and Salesforce also mentioned in their research papers that they used "The Pile" for training their AI models.

Creators impacted

Content creators and publishers affected by unauthorized use

Among the creators whose content was used without consent are tech reviewer Marques Brownlee, MrBeast, PewDiePie, Stephen Colbert, John Oliver, Jimmy Kimmel, and large news publishers like The New York Times, BBC, ABC News, and Engadget. Brownlee commented on the issue stating: "Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine."

Twitter Post

Take a look at popular YouTuber MKBHD's post

Legal concerns

Potential violation of YouTube's Terms and Conditions

While Apple and other companies likely used this publicly-available dataset in good faith, EleutherAI may have violated YouTube's terms and conditions by downloading the data. A Google spokesperson reiterated previous comments made by YouTube CEO Neal Mohan that companies using YouTube's data to train AI models would violate the platform's terms of service. The use of third-party datasets to train AI systems has raised legal and ethical concerns, particularly when the material is used without permission.

Transparency issues

AI companies criticized for lack of transparency

AI companies have generally not been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence, the company's own spin on generative AI. Proof News, which conducted the investigation, has released a lookup tool for users to check if subtitles from their YouTube videos or from their favorite channels are part of the dataset.