The $40 Trillion Question: Who Owns the Data Behind AI?

As AI companies face 50+ copyright lawsuits, the battle over training data could reshape how artificial intelligence is built and who profits from it.

$40 trillion tokens. That's how much text Meta used to train just one of its AI models—an amount that would take tens of millions of years for a human to read. But here's the catch: nobody asked permission from the millions of creators whose work fed this digital beast.

The transparency that once defined AI development has vanished. When OpenAI launched GPT-3 in 2020, it published a detailed "reading list" showing exactly what data trained the model. Today, that same company treats such information as a trade secret, even as its technology powers everything from school assignments to medical diagnoses.

This shift from openness to opacity has triggered what may become the defining legal battle of the AI era. More than 50 copyright lawsuits are already winding through U.S. courts, with several major cases set to advance in 2026. The outcome could fundamentally reshape how AI is built—and who gets paid for it.

The Creative Industries Fight Back

The numbers tell the story of what's at stake. Creative industries—music, film, publishing, and software—account for 8% of U.S. GDP and support nearly 12 million jobs. These sectors are now watching AI companies build billion-dollar businesses on the backs of their copyrighted work.

Björn Ulvaeus of ABBA captured the frustration perfectly: "You cannot avoid the fact that its sheer existence is because of the songs that I wrote in the past. I should be remunerated for that."

The legal battlefield is sprawling. Music publishers are suing Anthropic over song lyrics used to train Claude. Visual artists are challenging Google's image-generation tools. In a particularly aggressive move, Disney and Universal Pictures recently accused the AI image generator Midjourney of being a "bottomless pit of plagiarism."

"Piracy is piracy, and the fact that it's done by an AI company does not make it any less infringing," declared Disney's chief legal officer Horacio Gutierrez.

The Courts Are Split

AI companies aren't backing down. They argue that training models on vast collections of existing material is fundamentally different from traditional copying—it's more like teaching a student to write by showing them examples of good writing.

Some judges are buying this argument. In a case brought by book authors against Anthropic, U.S. District Judge William Alsup described AI training as "quintessentially transformative," comparing it to "training schoolchildren to write well."

But other judges are more skeptical. U.S. District Judge Vince Chhabria warned that AI training could fail fair-use tests if the technology risks "flooding the market" with content that undermines incentives for human creators—a core principle of copyright law.

The legal uncertainty reflects a deeper challenge: applying copyright laws written for the analog age to technology that operates at digital scale. When a single AI model ingests more text than humanity could read in millennia, traditional concepts of "copying" and "fair use" strain to their breaking point.

The Licensing Gold Rush

While courts deliberate, some companies aren't waiting for legal clarity. Disney invested $1 billion in OpenAI and agreed to let the company use Disney characters in its video generator. Warner Music has settled lawsuits with AI music startups and announced plans to build licensed tools together.

But this licensing approach creates a two-tier system. Major entertainment companies have the leverage to negotiate lucrative deals with AI firms. Independent creators and smaller rights holders don't. If courts ultimately decide that licensing isn't legally required, even these negotiated deals could evaporate.

The Trump administration appears unlikely to side with creators. At the launch of his AI Action Plan, Trump dismissed the idea of paying for training data: "You can't be expected to have a successful AI program when every single article, book, or anything else that you've read or studied, you're supposed to pay for."

Beyond Copyright: The Darker Side of Training Data

The copyright battles, significant as they are, represent just one facet of a broader transparency crisis. In 2023, researchers discovered more than 1,000 images of child sexual abuse in a public dataset used to train popular AI image generators. The material had been widely shared and embedded in multiple systems before anyone noticed.

Studies have also revealed that AI systems are disproportionately trained on English-language content and Western cultural perspectives, potentially encoding these biases into tools used worldwide. When training data remains hidden, these problems become nearly impossible to identify or address.

The Global Regulatory Response

Europe has already moved to require companies to publish summaries of their training data. No comparable rules exist in the United States, leaving courts and private licensing agreements to fill the regulatory void.

This patchwork approach is becoming increasingly untenable as AI systems become more powerful and pervasive. The question isn't just about money—it's about who gets to shape the technology that's reshaping society.