NewsGuard Audit: ChatGPT and Meta AI Top 40% Falsehood Rate as Chatbots Ditch 'I Don't Know'

The most popular consumer chatbots are now far more likely to repeat provably false claims about breaking news and controversial topics than they were a year ago, and the shift in behavior appears rooted in product trade‑offs that favor responsiveness over caution. That is the uncomfortable signal from NewsGuard’s latest AI False Claims Monitor, released as a one‑year anniversary edition. The audit reports that the ten leading generative AI tools repeated falsehoods in roughly 35% of news‑related replies in August 2025 — nearly double the 18% recorded in August 2024 — while refusal or “I don’t know” answers have all but disappeared.

The findings, first summarized for general readers by PC Guide, come as Microsoft Copilot, ChatGPT, and other assistants become deeply integrated into Windows, Office, and enterprise workflows. For IT admins and Windows users who rely on these tools for research, content generation, and decision support, the report serves as a stark reminder that more answers do not mean more accuracy, and that the industry’s pivot to web‑grounded retrieval has opened a measurable attack surface for misinformation.

What NewsGuard measured and why it matters

NewsGuard’s AI False Claims Monitor is a targeted red‑teaming exercise. Each month, the organization selects 10–15 provably false narratives from its False Claim Fingerprints database — claims that have been debunked by professional fact‑checkers — and tests them against the leading chatbots using three prompt personas: innocent (an everyday user asking a question), leading (a prompt that nudges the model toward the false claim), and malign (an adversarial attempt to coax out the falsehood). Responses are then classified by human analysts as a debunk (the model refutes the claim), a non‑response (the model declines to answer or states it doesn’t know), or misinformation (the model repeats the false claim as fact).

This adversarial methodology is not meant to be an all‑purpose accuracy benchmark. It is a stress test that simulates both ordinary user behavior and deliberate attempts to game the systems. That makes it particularly relevant for news, politics, health, and corporate reputation — domains where confidently delivered inaccuracies can cause immediate harm.

The headline numbers: from 18% to 35% in one year

The aggregate repeat‑falsehood rate across all ten tested chatbots rose from 18% in August 2024 to 35% in August 2025. Over the same period, the rate at which models refused to answer or said they didn’t know plummeted from approximately 31% to near zero. In other words, chatbots are now answering nearly every prompt, but over a third of those answers on news‑related topics contain provably false claims.

NewsGuard also publicly named model‑level scores for the first time. The August 2025 snapshot, as reported by PC Guide and multiple independent outlets, shows wide variation:

ChatGPT and Meta AI (Llama‑based): approximately 40% false claims.
Microsoft Copilot and Mistral’s le Chat: around 35–36%.
xAI’s Grok and You.com: roughly 33%.
Google Gemini: about 17%.
Anthropic’s Claude: near 10%.
Inflection’s Pi and Perplexity: much higher repeat rates in the August sample, though exact figures were not singled out in all public summaries.

These numbers represent a point‑in‑time measurement of specific product configurations. Copilot’s behavior, for example, can vary depending on whether it is using web grounding in Edge, inside Office applications, or via the Bing interface, and all vendors frequently update their retrieval stacks and safety filters.

Why the deterioration happened: technical and product drivers

The dramatic rise in repeated falsehoods does not reflect a sudden drop in raw model intelligence. Instead, NewsGuard’s analysis and supporting investigative reporting point to three converging trends that changed how chatbots produce answers.

1) Web grounding turned retrieval into a new attack surface

Many chatbots moved from static knowledge cutoffs to real‑time web retrieval specifically to improve recency on news and event‑driven queries. That change makes assistants more useful, but it also exposes them to a polluted online ecosystem. SEO farms, AI‑generated content mills, and deliberately created microsites optimized for crawler ingestion can all surface false narratives that retrieval algorithms may treat as authoritative. When a model retrieves and incorporates information from such sources without robust source‑quality discrimination, the resulting answer can amplify exactly the falsehood the retrieval was meant to fact‑check.

2) Reward tuning prioritized helpfulness over refusal

Vendors have increasingly optimized models for user engagement by penalizing refusals and rewarding complete, on‑task answers. That product decision directly reduces the “I don’t know” rate — which fell from 31% to 0% in NewsGuard’s tracking — but also removes a critical safety valve. In ambiguous or information‑scarce situations, models now default to generating an answer rather than bowing out, and without additional verification layers, that answer is often confidently wrong.

3) Disinformation networks now specifically target machine readers

Investigative reports have documented coordinated operations that seed false claims across multiple web properties in machine‑digestible formats. By creating high volumes of reposts, aggregator pages, and synthetic sites, malign actors can inflate the apparent footprint of a false narrative. Retrieval systems that rely on naive popularity or recency signals can mistake that volume for credibility. NewsGuard’s audit cites examples where chatbot outputs referenced pages tied to known influence networks, a laundering process that turns low‑quality web content into confident chatbot assertions.

Comparative model performance: a snapshot, not a final grade

While the August numbers put Anthropic’s Claude and Google Gemini in a relatively better light, and ChatGPT and Meta AI near the bottom, it is important to understand what these figures actually mean. The test is domain‑specific: it measures susceptibility to a rotating set of circulating falsehoods in news, health, and political topics. A model that scores poorly on this adversarial probe may still perform well on code generation, mathematical reasoning, or summarization of private documents.

Moreover, the sample size is deliberately small — 10–15 claims per month, each prompted in three ways. A single month’s ranking can be sensitive to which specific false narratives were included. Longitudinal aggregation over several months smooths out some of that noise. Vendors also tune their products continuously; the version of Copilot or ChatGPT tested in August 2025 may already behave differently today.

What the results objectively show is that the architectural decision to rely on live web retrieval without equivalent investment in source verification has made chatbots more willing to repeat falsehoods than they were a year ago, and that some vendors have managed that trade‑off more conservatively than others.

What this means for Windows users, IT admins, and content professionals

For anyone using Microsoft Copilot in Edge, Windows, or the Office suite, the practical implications are immediate. The assistant that drafts emails, summarizes meetings, and answers questions about current events is sampling from the same noisy web ecosystem that NewsGuard tested. Without guardrails, the output can carry fabricated facts with an authoritative tone.

First, treat AI outputs as drafts, not evidence. The probabilistic nature of large language models means confident language often masks fragile sourcing. This is especially critical for any content that will be published, forwarded, or used to make business decisions.

Second, demand provenance. If a Copilot response or ChatGPT answer includes citations, click through to the original sources. Verify that they are credible, dated, and corroborated by established news outlets. If the model cannot provide a verifiable source, treat the information as unverified.

Third, build verification into workflows. For enterprise environments, configure retrieval policies to whitelist curated news and fact‑check domains where possible. Use conservative default modes for news and political queries, and require human sign‑off on any AI‑generated content that references third‑party claims.

Practical checklist: steps you can take today

When an assistant answers a current‑events query, do not accept the first answer. Ask “What are your sources?” and independently check them.
For business or public communications, implement a mandatory human review step for any output that concerns news, legal, or health topics.
If you manage Copilot for an organization, use administrative controls to restrict web grounding to trusted domains and enable inline provenance flags.
Use multi‑source verification: if an AI cites a single obscure website, consult at least two established outlets or fact‑checkers before acting on the content.
Monitor the monthly NewsGuard updates and independent reporting to track whether your preferred vendor’s falsehood rate is improving or regressing.
Educate end users: make it clear that “confident” does not equal “correct,” and that the disappearance of “I don’t know” is a feature of the product, not a guarantee of accuracy.

Vendor responsibilities and plausible technical fixes

The NewsGuard audit puts a spotlight on design choices that vendors can adjust. Short‑term fixes include strengthening retrieval stacks to prefer high‑quality, curated domains for news queries, surfacing provenance metadata more prominently, and reintroducing conservative refusal behaviors for high‑risk topics. The trade‑off is reduced immediate helpfulness — a calibration NewsGuard argues is necessary.

Medium‑term approaches involve multi‑modal verification: retrieving multiple sources and computing a credibility score before answering, deploying a separate verifier model to fact‑check draft responses, and defaulting to an explanation of uncertainty when verification fails. These layered techniques increase compute and engineering complexity but directly address the failure modes exposed by the audit.

In the long term, the problem cannot be solved inside the model alone. Because disinformation actors exploit the open web ecosystem, technical mitigation requires industry coordination — search engines, hosting providers, and content platforms working together to identify and deprioritize machine‑groomed content. Independent audits like NewsGuard’s will remain essential to track whether those efforts succeed.

The audit’s strengths — and its constraints

NewsGuard’s methodology has notable strengths. Its use of realistic adversarial personas models real user behavior and abuse patterns, and its reliance on human analysts to adjudicate responses reduces the noise that pure automated benchmarks can introduce. De‑anonymizing model‑level results adds accountability and forces vendors to explain their web‑grounding choices.

However, the audit is intentionally narrow. It is a red‑team stress test, not a comprehensive accuracy score. The small, rotating sample size means single‑month rankings should not be treated as permanent report cards. And because vendor configurations change frequently, any number represents a tested instance at a moment in time. Extrapolating the 35% aggregate rate to all types of queries — code, math, creative writing — would be a misinterpretation.

Where the audit truly shines is in illuminating specific, operational failure modes: the all‑but‑eliminated refusal rate, the link between web retrieval and repeated falsehoods, and the variation across products that suggests better design choices are possible.

Where this story could go next

Watch for vendor responses. Already, the publication of named rankings has prompted statements from several AI companies, and we can expect product updates that either tighten retrieval quality or reintroduce cautious refusal modes for news‑sensitive queries. Microsoft, OpenAI, and Google all maintain safety blogs and changelogs that will reflect any such changes.

Broader industry audits are likely to follow. NewsGuard’s de‑anonymized numbers may encourage other organizations to publish their own transparency projects, which would sharpen the picture and apply additional pressure on vendors to invest in verification.

At the web ecosystem level, pressure is growing on search engines and hosting providers to combat AI‑targeted disinformation. If platforms begin to deprioritize content that is clearly designed for machine ingestion, retrieval‑based falsehoods could decline. Conversely, unmitigated proliferation of machine‑targeted propaganda will keep error rates elevated.

Conclusion: a sober, pragmatic posture

The core takeaway from NewsGuard’s audit is clear: making AI more human‑like in its willingness to answer has also made it more convincingly wrong. For Windows users, IT administrators, and content professionals, the responsible posture is straightforward. Assume that any output concerning current events may be incorrect, require provenance for news‑sensitive claims, and never let AI‑generated content bypass human review when accuracy matters.

At the product level, the path forward is equally clear. Vendors must invest in provenance, conservative defaults for high‑risk queries, and layered verification. The web ecosystem must reduce the supply of machine‑targeted disinformation. And independent audits must become a permanent fixture of the accountability landscape. Until systems are engineered to resist the easiest forms of manipulation, the golden rule of generative AI holds: trust, but verify.