400 Publishers Accuse Microsoft and OpenAI of Stripping Copyright Data for AI Training

Nearly 400 newspaper publishers filed a federal lawsuit against Microsoft and OpenAI in New York on June 24, 2026, accusing the tech giants of systematically scraping paywalled articles, stripping copyright management information, and using the copyrighted content to train generative AI models without permission or compensation. The suit, lodged in the Southern District of New York, represents one of the largest collective actions by the news industry against artificial intelligence developers to date. It alleges that Microsoft and OpenAI knowingly copied millions of articles from local, regional, and national newspapers, including material locked behind strict paywalls, to feed models that now power tools like Microsoft Copilot and ChatGPT. The publishers argue that this unauthorized use undermines their business models, depriving them of subscription revenue and licensing fees while enriching the tech companies.

The complaint details a pattern of willful copyright infringement, claiming that the defendants employed crawlers to bypass technical protections on news websites, scrubbed metadata identifying copyright holders, and incorporated the text into training datasets without consent. For Windows users who have grown accustomed to Copilot’s AI-driven summaries and content generation, the lawsuit raises urgent questions about the provenance of the information popping up in their taskbar and Office apps. If the court sides with the publishers, Microsoft could face billions in damages and be forced to alter how its AI models are trained, potentially limiting the capabilities of Copilot across Windows 11 and future versions.

The Allegations at a Glance

The plaintiffs, a coalition of nearly 400 newspaper publishers ranging from small community weeklies to major metropolitan dailies, outline several core violations. First, they accuse Microsoft and OpenAI of direct copyright infringement by reproducing entire articles in training corpora without a license. Second, they argue that the activity violates the Digital Millennium Copyright Act (DMCA) through the removal of copyright management information—digital tags that identify the owner, terms of use, and other rights data. Third, the suit points to terms-of-service breaches and tortious interference, claiming the scraping contravened explicit prohibitions on automated access to publishers’ websites.

The timing is significant: by 2026, generative AI has become deeply embedded in productivity software. Microsoft’s Copilot, once a sidebar experiment, now autocompletes paragraphs in Word, generates slides in PowerPoint, and answers complex queries directly in Windows Search. The publishers assert that these features rely heavily on the body of journalism they produced, yet Microsoft and OpenAI have paid nothing for that raw material. They seek statutory damages under the Copyright Act, which could run up to $150,000 per infringed work, along with injunctive relief to halt further use of their content.

How AI Training Relies on News Content

Large language models (LLMs) like those behind Copilot and ChatGPT learn from vast troves of text scraped from the open web. News articles are particularly prized because they are factual, well-written, and cover current events—qualities that help models generate accurate, context-aware responses. The publishers’ complaint highlights specific examples: a local crime report from the Sarasota Herald-Tribune, an investigative piece from The Seattle Times, and a restaurant review from The Philadelphia Inquirer all emerged almost verbatim in output from Microsoft’s AI tools, according to the filing.

Microsoft has previously argued that training on public web data constitutes fair use, a defense it has invoked in similar cases. However, the new suit emphasizes that a substantial portion of the ingested content sat behind paywalls and was never “publicly available” in the traditional sense. The publishers allege that Microsoft and OpenAI exploited authentication loopholes or used tools to bypass gateways, treating subscriber-only journalism as free raw material. This could weaken the fair use argument, as courts often consider whether the original work is unpublished or restricted when evaluating the “nature of the copyrighted work” factor.

The Fair Use Question

Fair use under U.S. copyright law rests on four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original. Microsoft and OpenAI will likely argue that their AI training is transformative—it doesn’t merely reproduce articles but uses them to teach models to understand language, which then generate new, original content. They may also point to the public benefit of AI tools that help users draft emails, research topics, and create presentations.

The publishers counter that the verbatim or near-verbatim reproduction of their articles in response to user prompts proves that the models are not truly transformative; they are instead a sophisticated retrieval-and-remix engine that competes directly with the original source. If a Windows user can ask Copilot to summarize today’s top news and receive a digest lifted from paywalled articles without ever visiting the newspaper’s site, the publishers argue, the AI product becomes a market substitute. This market harm factor is often the most decisive in fair use cases, and the complaint goes to great lengths to document lost advertising dollars, declining subscriptions, and the erosion of local journalism.

Previous high-profile lawsuits, such as The New York Times’ action against OpenAI in late 2023, have not yet produced a final ruling on whether AI training falls under fair use. However, several of those cases have survived motions to dismiss, signaling that courts are willing to entertain the publishers’ arguments. The sheer number of plaintiffs in this 2026 filing may amplify its influence, as it demonstrates industry-wide harm rather than a single plaintiff’s grievance.

Microsoft’s Deepening AI Integration and the Stakes for Windows

For Microsoft, the lawsuit is far more than a legal nuisance. Since its multi-billion-dollar partnership with OpenAI, the company has staked its future on AI. Copilot is now a core part of Windows 11 and the upcoming iterations of the operating system. It summarizes documents, writes code, manages notifications, and even controls device settings. Under the hood, these capabilities draw on models that the publishers claim were illicitly trained on their copyrighted works.

Should an injunction force Microsoft to retrain its models without the offending data, the process could take months and degrade performance, particularly in areas requiring up-to-date knowledge and nuanced language. Copilot’s ability to craft a news roundup, for instance, might become noticeably weaker or rely on fewer sources. For the hundreds of millions of Windows users, the impact could be subtle—fewer relevant suggestions, less coherent text generation—but for Microsoft, the financial and reputational costs would be substantial.

Moreover, Microsoft’s enterprise customers, many of whom have adopted Copilot for Microsoft 365, might face legal uncertainty if the underlying models are found to be infringing. The company could be forced to indemnify users against copyright claims, or worse, suspend certain features until licensing agreements are in place. Neither outcome is appealing for an organization that has bet billions on AI-driven productivity.

A Growing Wave of AI Copyright Battles

This lawsuit joins a rapidly expanding docket of AI-related copyright cases. Since 2023, authors, artists, music labels, and media conglomerates have filed suit against developers of generative AI. In the publishing sphere, The New York Times, Getty Images, and a group of eight newspapers owned by Alden Global Capital have led the charge. The 400-publisher coalition, however, dwarfs previous efforts in scale. It represents a collective realization that piecemeal licensing deals—such as those struck between OpenAI and a handful of large publishers like Axel Springer—do little to protect the broader industry.

A key development came in early 2026 when a federal appeals court ruled that the Copyright Office’s denial of AI-generated artwork registrations did not shield the designers of the underlying models from liability for using copyrighted training data. That decision, though not directly on point, emboldened copyright holders by affirming that using protected works without permission is not automatically excused simply because the end product is non-human. The publishers’ legal team is likely to cite that ruling extensively.

Internationally, regulators are also tightening. The European Union’s AI Act, fully enforced by mid-2026, imposes transparency obligations on general-purpose AI systems, including disclosure of copyrighted materials used in training. While the U.S. lacks a comparable federal framework, the mounting lawsuits are pushing Washington to consider legislation that would give publishers a clearer path to compensation.

Publisher Perspectives: Journalism at Risk

Behind the legal jargon, the publishers paint a bleak picture of an industry in crisis. Local newspapers, in particular, have been decimated by the digital shift, losing ad revenue to platforms like Google and Meta. The rise of AI-generated content summaries threatens to sever the remaining direct relationship between readers and newsrooms. When Copilot can answer “what happened at the city council meeting last night?” by synthesizing a local paper’s reporting without ever crediting or linking to the source, the paper loses the traffic that sustains its operations.

“This isn’t just about copyright,” said an attorney representing the publishers, speaking on condition of anonymity because the case is pending. “It’s about the survival of independent journalism. If the tech giants can vacuum up our work without paying, there won’t be any work left to take.” The complaint includes testimonials from editors and publishers describing staff layoffs and shuttered bureaus, directly attributing the decline to unfair competition from AI-powered aggregation.

Some publishers indicate they would have been willing to negotiate licensing deals, but Microsoft and OpenAI never approached them. Instead, the companies reportedly responded to earlier infringement notices by claiming their activities were protected by fair use and by pointing to the “opt-out” protocols they eventually implemented. However, the publishers argue that such opt-outs are ineffective, as they only apply to future scraping and do nothing to remedy the past ingestion of years’ worth of archives.

Potential Outcomes and Industry Ramifications

Legal experts watching the case say it could take several years to resolve, given the complexity of the legal questions and the likelihood of appeals. A few possible scenarios emerge. First, Microsoft and OpenAI could settle—perhaps by establishing a substantial licensing fund, much like YouTube’s Content ID model, that compensates publishers for training data. Second, the court could issue a partial ruling, finding that some uses infringe while others are fair, leading to a messy, fact-specific patchwork of liability. Third, the court could side squarely with the publishers, sending shockwaves through the AI industry and forcing a fundamental rethinking of how models are trained.

For Windows users, a settlement or licensing arrangement would be the most seamless outcome, allowing Copilot to continue humming along without disruption—though perhaps with more attribution and links to source journalism. A ruling against Microsoft, however, might force the company to redesign Copilot’s retrieval-augmented generation layers, reducing its reliance on external web content and emphasizing licensed or proprietary data. This could make the AI assistant less useful for real-time news queries but more defensible legally.

Beyond the courtroom, the lawsuit accelerates a broader reckoning over the economics of AI. Content creators—whether journalists, photographers, or musicians—are increasingly demanding a stake in the value their work creates for tech platforms. The outcome here could set the template for how those demands are met. Windows users, whether they realize it or not, have a stake in the fight: the quality, trustworthiness, and legality of the AI tools woven into their daily computing experience hang in the balance.