A major AI research lab and a major tech company have been accused of using captions from tens of thousands of YouTube videos without permission to train artificial intelligence models
Google has strict rules against harvesting unauthorized material from YouTube, and a new investigation by Proof News claims that several major tech companies have used captions from over 170,000 videos
The subtitles were part of “the Pile,” a massive data set compiled by the nonprofit EleutherAI Originally intended to provide a quick way for small businesses and individuals to train their models, major high-tech and AI companies have also adopted this vast storehouse of information
Apple was initially on the list of companies claiming to have used Pile, but has since refuted these claims Apple has stated that it is committed to respecting the rights of creators and publishers and offers an opt-out from being used for Apple Intelligence training
Several studies now show that two things are essential for more advanced AI models: data and computing power
Increasing either or both will lead to better response, improved performance, and scale However, data are increasingly scarce and expensive commodities
Companies like OpenAI and Google combine their own massive data repositories with deals with major publishers and Reddit
Meta has Facebook, Instagram, Threads, and WhatsApp, but faces user backlash Apple has a huge amount of user data, but its own privacy policy makes it less useful for initial model training
This lack of available data leads companies to look for new sources of information to train their next-generation models, but not all of those sources are willing to provide data or even aware that the information they create is being used to train AI Not all of them are willing to provide data or even recognize that the information they create is being used to train AI
Several lawsuits are currently pending against AI image and music generation companies over the merits of copyright fair use of training data
While the AI companies are not directly responsible for the use of these YouTube captions in model training datasets, questions are being raised about the source of the data and how rigorously the big tech companies check when assessing their rights
It was not just videos from smaller creators that were included; videos from the BBC, NPR, Wall Street Journal, Mr Beast, and Marques Brownlee were also included in the data set
A total of 48,000 channels and 173,536 videos were included in the YouTube subtitle dataset Some of the videos contained conspiracy theories and parodies that could affect the integrity of the final model
This is not the first time YouTube has been at the center of an AI training data controversy, and whether or not OpenAI CTO Mira Murati used YouTube to train their advanced (but still unpublished) AI video model Sora, She is unable to confirm or deny
According to Wired, Nebula CEO Dave Wiskus said that using data without consent is “theft” and “disrespectful”
Anthropic said in a statement to Ars Technica that Pile is only a small subset of YouTube's subtitles and that YouTube's terms and conditions cover only direct use of its platform This is different from the use of the Pile dataset
“For possible violations of YouTube's terms of use, we need to refer to the authors of The Pile”
Google has stated that it has taken steps over the years to prevent abuse, but has not provided details as to what those measures are or whether they constitute a violation of its terms
However, Google is not entirely blameless, as it has been discovered that Gemini AI scans user documents stored on Google Drive, even when users do not give permission
While creators are angry about this finding, the issue of the source and copyright of data used in training models is still up for debate
This potential case of data misuse will likely be wrapped up into a broader conversation about whether training data is under fair use or requires a specific license A final decision on this matter will not be reached for years
Comments