When Sora (Open AI’s text-to-video model) launched in February, Open AI’s CTO, Mira Murati, claimed that it was trained using "publicly available data and licensed data."
But when probed on where this ‘public and licensed’ data came from, she had no idea if it included videos from YouTube, Instagram, or Facebook.
The lack of knowledge over where the data used to train Sora came from, clearly struck a chord with YouTube CEO, Neal Mohan: He’s openly stated that–although there is no concrete evidence to suggest as such–if OpenAI has used YouTube data to train Sora, it is a “clear violation” of its terms of service.
In his view, people spend hours creating content for YouTube. And so they expect–when they upload the fruits of their labor–for their terms of service (one of which is “to prohibit unauthorized scraping or downloading of YouTube content”) to be met. Therefore, according to Mohan, having content scraped and used by a third party is “a clear violation of our terms of service.”
No one knows whether OpenAI used YouTube data to train Sora, and–at the time of writing this–they have yet to comment, either way.
But the evidence doesn’t look good.
It’s well known that OpenAI scrapes vast amounts of data–some copyrighted and some not–from the internet to build AI models that understand context and human nuances and better replicate human intelligence. They even recently admitted to using copyrighted data to train its models, in a filing submitted to the British House of Lords, when the UK Government was considering a new law that would limit how AI companies could use copyrighted material, claiming it was “impossible” to build the technology without it.
And, with well-publicized battles against the New York Times and others for copyright infringement, we have to agree that it doesn’t look good for OpenAI, at this point.
But is copyright infringement the only reason YouTube is outraged over this potential violation, or is something else fueling the fire?
Google (who bought YouTube in 2006) is busy developing its own suite of AI tools, and if it wants to keep up in the AI race, against the likes of OpenAI, they also need vast amounts of data to train its models. So, although YouTube CEO, Mohan is quick to assure us that “Google adheres to individual contracts with creators before using any YouTube videos”, aren’t they planning to do the same thing that they’ve just slammed OpenAI for?
They seem very protective over their creators when it comes to competitors using their data, but when it comes to using data to advance their own work, they don't seem to give content violation much thought…