TechingToday

NVIDIA Accused of Scrapping 80 Years of Videos Daily to Train AI Models - What You Need to Know

General

The more we learn about how AI is built, the more we are bombarded with reports of companies using copyrighted content to train AI without permission

NVIDIA has been accused of downloading videos from YouTube, Netflix, and other datasets to train commercial AI projects, and according to 404 Media, the company is using the downloaded videos to train its 3D world generator, the Omniverse, and used them to train AI models for "digital humans" such as the embodied AI "Gr00t" project

When contacted via email, NVIDIA told Tom's Guide that it "respects all rights of content creators" and that its research efforts "fully comply with the letter and spirit of copyright law"

"Copyright law protects certain expressions, but not facts, ideas, data, or information Everyone is free to learn facts, ideas, data, and information from other sources and use them to express themselves"

They also argued that AI model training is an example of free use of content for transformative purposes

Netflix declined to comment, but YouTube disagreed with NVIDIA's assessment, with YouTube's policy and communications manager Jack Marron pointing to comments made by CEO Neil Mohan to Bloomberg in April, " Our earlier comments are still valid"

At the time, Mohan was responding to reports that OpenAI was training its Sora AI video generator on YouTube videos without permission He said, "You cannot download things like transcripts or video bits These are the rules regarding content on our platform"

He added, "We have a very strict policy regarding the content on our platform

This is not the first time this summer that NVIDIA has been accused of YouTube scraping Several large companies, including Apple and Anthropic, have reportedly been pulling information from a huge dataset called "the Pile," which features thousands of YouTube videos, including popular creators like Marques Brownlee and PewDiePie

According to 404Media, employees who raised ethical or legal concerns were told by management that they had permission from "the highest levels of the company"

"This is an executive decision," replied Ming-Yu Liu, NVIDIA's vice president of research We have blanket approval for all data"

Apparently, some managers kicked the can down the road, saying that scraping was an open legal issue that the company would address later

The datasets allegedly discarded by NVIDIA are not just YouTube and Netflix videos The company is also said to have pulled from MovieNet, a database of movie trailers, a library of video game footage, and WebVid, a Github video dataset

Scraping may create opportunities for inferior data to be used to train models

Bruno Kurtic, CEO of Bedrock Security, points out that "given the sheer size of the data used, trying to do this manually will always result in incomplete answers, and as a result the models may not stand up to regulatory scrutiny" He points out

He further suggested that AI-building companies should provide an auditable "data bill of materials," highlighting where the data they train came from and what was ethically sourced

That is one way for companies to solve the problem of AI, but which data is clean when everyone is scrapping everyone else

Allegedly, some of the videos used by NVIDIA were from a vast library of YouTube videos marked as academic purposes only The license for this use clearly states that the videos are for academic research purposes only Apparently, NVIDIA argued that academic libraries are fair game for commercial AI products

YouTube's parent company, Alphabet, is also not immune from criticism for scrapping the Internet for AI models Last summer, Google announced plans to "use all publicly available information to help train Google's AI models and build products and features such as Google Translate, Bird, and cloud AI capabilities"

Everything posted on Google platforms like YouTube was considered fair game, but it is safe to assume that everything posted across the Internet was also considered fair game

At the time, a Google spokesperson told Tom's Guide, "Google has long made clear in its privacy policy that it uses publicly available information from the open web to train the language models for services like Google Translate This latest update clarifies that this includes new services like Bard We will incorporate privacy principles and safeguards into the development of our AI technology, in line with our AI Principles

This means that any public posting made at any point in time will fall prey to Google's own AI ambitions

The full 404 Media report has more details and is worth a read