TechingToday

AI firm accused of scanning YouTube data without permission

General

A major AI research lab and a major tech company have been accused of using captions from tens of thousands of YouTube videos without permission to train artificial intelligence models

Google has strict rules against harvesting unauthorized material from YouTube, and a new investigation by Proof News claims that several major tech companies have used captions from over 170,000 videos

The subtitles were part of “the Pile,” a massive data set compiled by the nonprofit EleutherAI Originally intended to provide a quick way for small businesses and individuals to train their models, major high-tech and AI companies have also adopted this vast storehouse of information

Apple was initially on the list of companies claiming to have used Pile, but has since refuted these claims Apple has stated that it is committed to respecting the rights of creators and publishers and offers an opt-out from being used for Apple Intelligence training

Several studies now show that two things are essential for more advanced AI models: data and computing power

Increasing either or both will lead to better response, improved performance, and scale However, data are increasingly scarce and expensive commodities

Companies like OpenAI and Google combine their own massive data repositories with deals with major publishers and Reddit

Meta has Facebook, Instagram, Threads, and WhatsApp, but faces user backlash Apple has a huge amount of user data, but its own privacy policy makes it less useful for initial model training

This lack of available data leads companies to look for new sources of information to train their next-generation models, but not all of those sources are willing to provide data or even aware that the information they create is being used to train AI Not all of them are willing to provide data or even recognize that the information they create is being used to train AI

Several lawsuits are currently pending against AI image and music generation companies over the merits of copyright fair use of training data

While the AI companies are not directly responsible for the use of these YouTube captions in model training datasets, questions are being raised about the source of the data and how rigorously the big tech companies check when assessing their rights

It was not just videos from smaller creators that were included; videos from the BBC, NPR, Wall Street Journal, Mr Beast, and Marques Brownlee were also included in the data set

A total of 48,000 channels and 173,536 videos were included in the YouTube subtitle dataset Some of the videos contained conspiracy theories and parodies that could affect the integrity of the final model

This is not the first time YouTube has been at the center of an AI training data controversy, and whether or not OpenAI CTO Mira Murati used YouTube to train their advanced (but still unpublished) AI video model Sora, She is unable to confirm or deny

According to Wired, Nebula CEO Dave Wiskus said that using data without consent is “theft” and “disrespectful”

Anthropic said in a statement to Ars Technica that Pile is only a small subset of YouTube's subtitles and that YouTube's terms and conditions cover only direct use of its platform This is different from the use of the Pile dataset

“For possible violations of YouTube's terms of use, we need to refer to the authors of The Pile”

Google has stated that it has taken steps over the years to prevent abuse, but has not provided details as to what those measures are or whether they constitute a violation of its terms

However, Google is not entirely blameless, as it has been discovered that Gemini AI scans user documents stored on Google Drive, even when users do not give permission

While creators are angry about this finding, the issue of the source and copyright of data used in training models is still up for debate

This potential case of data misuse will likely be wrapped up into a broader conversation about whether training data is under fair use or requires a specific license A final decision on this matter will not be reached for years