Meta Trained Its AI on Pirated Books. Now It's in Court.

Five major publishers and author Scott Turow have filed a class action lawsuit against Meta, alleging the company used illegal pirate sites like LibGen to train its Llama AI models without permission.

Meta didn't go to a library. It allegedly went to the internet's black market.

Five of the world's largest book publishers—Macmillan, McGraw Hill, Elsevier, Hachette, and Cengage—along with bestselling author Scott Turow have filed a class action lawsuit against Meta, accusing the company of committing "one of the most massive infringements of copyrighted materials in history." The alleged vehicle: notorious pirate sites including LibGen, Anna's Archive, and Sci-Hub, used to harvest books and academic journals as training data for Meta's Llama AI models—without a single licensing agreement.

What the Lawsuit Actually Claims

The complaint doesn't just allege careless scraping. It accuses Meta of knowingly pulling copyrighted material from sites that exist specifically to circumvent copyright law. LibGen hosts millions of academic texts and textbooks for free download. Sci-Hub has been the subject of publisher lawsuits for years for distributing paywalled research without authorization. These aren't gray-area platforms—they've been labeled piracy operations by courts in multiple jurisdictions.

The publishers argue that Meta didn't stumble onto this content. It chose it. And then it built a commercially deployed AI product—Llama, now embedded across Meta's platforms and licensed to third-party developers—on top of that allegedly stolen foundation. The plaintiffs are seeking damages and, crucially, a legal precedent that would require AI companies to license content before training on it.

Why This Case Cuts Deeper Than the Others

AI copyright litigation isn't new. The New York Times is suing OpenAI and Microsoft. Authors including George R.R. Martin and John Grisham filed suit against OpenAI in 2023. But most of those cases hinge on whether training AI on publicly accessible web content constitutes fair use—a genuinely unsettled legal question.

This lawsuit introduces a harder-edged element: intentionality. If Meta knowingly sourced data from illegal repositories, the fair use defense becomes considerably more difficult to sustain. Fair use analysis under U.S. copyright law weighs factors including the commercial nature of the use and the effect on the market for the original work. Using pirated sources to build a product that competes with—or reduces demand for—the original content cuts against both.

Advertise with Us

[email protected]

For Meta, the legal exposure here isn't just financial. A ruling against the company could force a fundamental rethink of how Llama models were built and what liabilities attach to products trained on contested data.

Three Stakeholders, Three Very Different Problems

For publishers and authors, this is existential arithmetic. A textbook that sells for $200 generates royalties for its author and revenue for its publisher. If that same textbook is scraped into an AI that can answer any question from it on demand, the market for the original collapses. Scott Turow, a longtime advocate for authors' rights, lends the lawsuit both legal credibility and moral weight.

For AI developers and startups, a ruling in the publishers' favor creates a structural problem that goes beyond Meta. High-quality, curated text data—the kind found in books and peer-reviewed journals—is precisely what separates capable AI models from noise. If licensing becomes legally mandatory, the cost of training frontier models rises sharply. That's manageable for Meta or Google. For open-source projects and smaller labs, it could be prohibitive. The irony: a ruling designed to protect creators might further consolidate AI power among the largest incumbents.

For regulators and policymakers, this lawsuit may accomplish faster what legislation has struggled to do. The EU AI Act already mandates transparency around training data. In the U.S., Congress has moved slowly on AI-specific copyright reform. Courts, however, move on their own schedule—and a landmark ruling here could effectively set data licensing policy before any bill passes.

The Fair Use Gamble That Built an Industry

Silicon Valley's approach to training data has always carried a quiet legal bet: that using copyrighted material to train AI would be deemed transformative enough to qualify as fair use. That bet has never been fully tested at the appellate level in the context of generative AI. Multiple cases are working their way through the courts simultaneously, and the outcomes are genuinely uncertain.

What makes the Meta case a potential inflection point is the specificity of the piracy allegation. Courts evaluating fair use ask whether the defendant acted in good faith. Deliberately sourcing data from platforms that courts in other cases have already ruled illegal is a difficult posture to defend as good faith.

The publishers aren't just seeking damages. They're trying to establish that the economics of AI training must include the people who created the content that made the training valuable in the first place.

What the Lawsuit Actually Claims

Why This Case Cuts Deeper Than the Others

Three Stakeholders, Three Very Different Problems

The Fair Use Gamble That Built an Industry

Thoughts

Related Articles