USA TODAY Tests Microsoft Copilot for NFL Picks: 8-8 Record, Stale Data, and What It Means for AI in Sports

Microsoft Copilot went 8-8 picking NFL games in Week 1, and USA TODAY is back for Week 2 with a fresh slate of AI-generated forecasts—revealing both the promise and the pitfalls of using conversational AI in high-stakes sports prediction. The experiment, which fed every matchup to Copilot with an identical prompt and published the chatbot’s winner, score, and rationale, is one of the most transparent public tests of how a large language model handles a fast-moving, data-intensive domain. The results offer a case study in what works, what breaks, and why human oversight remains non-negotiable.

Background: How USA TODAY Ran the Experiment

For Week 2 of the 2025 NFL season, USA TODAY Sports repeated a simple, repeatable workflow. For each of the 16 games, a reporter prompted Microsoft Copilot with: “Can you predict the winner and the score of the X vs. Y NFL Week 2 game?” The chatbot returned a winning team, a final score, and a brief explanation of its reasoning. The newsroom then published the full set of picks alongside human analysis that evaluated Copilot’s logic, flagged outdated information, and re-prompted the model when necessary.

This approach matters because the NFL operates in an environment where last-minute injury reports, practice participation, and roster moves can swing win probabilities dramatically. A conversational assistant used as a forecasting tool for publishers or bettors must cope with stale knowledge, ambiguous data, and the need to convey calibrated uncertainty—not just a single deterministic score.

Week 2 Picks at a Glance

Copilot’s Week 2 projections came with full scores and short rationales. Highlights include:

Green Bay Packers 27, Washington Commanders 20 – Copilot emphasized Lambeau Field’s home-field advantage and the Packers’ balanced offense.
Cincinnati Bengals 30, Jacksonville Jaguars 23 – A projected shootout driven by Joe Burrow’s passing upside against a vulnerable Jaguars defense.
Dallas Cowboys 27, New York Giants 16 – The AI flagged New York’s offensive inefficiencies, Russell Wilson’s low completion rate in Week 1, and left tackle Andrew Thomas’s injury.
San Francisco 49ers 20, New Orleans Saints 19 – A projection weakened significantly by Brock Purdy’s uncertain status; the 49ers’ defense and Christian McCaffrey’s versatility gave Copilot enough confidence to still pick San Francisco.
Buffalo Bills 30, New York Jets 24 – Josh Allen’s dominant Week 1 performance (424 total yards, four touchdowns) convinced the model that he would be the difference.

The full slate—16 games—appears in the USA TODAY write‑up, each accompanied by the model’s reasoning and a human assessment of that reasoning.

The AI’s Heuristics: How Copilot Reached Its Picks

Across the published outputs, several consistent heuristics drive Copilot’s decisions:

Favor established, track-record quarterbacks. The model repeatedly leans on QB pedigree as a high-signal input, backing names like Burrow, Allen, and Rodgers.
Reward defensive strength and pass-rush advantages. Copilot frequently cites a team’s front seven or high pressure rate as a decisive matchup lever.
Weight venue and historical home advantage. Lambeau Field and Hard Rock Stadium get specific mentions as meaningful context.
Use round, prototypical scoring anchors. Winning teams are often placed in the mid‑to‑high 20s, suggesting the model defaults to plausible averages rather than calibrated variance.

These heuristics are sensible and mirror how many human analysts reason at a glance, but they are not a substitute for high-frequency, validated updates about injuries, practice status, and short‑term roster changes.

Verification: The Facts Behind the Picks

Because conversational assistants can hallucinate or run on stale data, USA TODAY’s project explicitly re‑prompted Copilot when it produced outdated facts. The write‑up included human checks of several claims. Independent verification is essential; below are the most consequential checks performed for this feature, cross‑referenced with independent reporting:

Brock Purdy’s Week 2 status: Multiple outlets reported the 49ers’ quarterback was a “long shot” to play due to toe and shoulder injuries. Reuters and the NFL’s own reporting both use that phrase, and Copilot’s projection acknowledged the uncertainty but still produced a single score without quantifying the range of outcomes.
Josh Allen’s Week 1 explosion: Buffalo’s official recap confirmed 424 total yards and four total touchdowns in a 41‑40 victory. This verified stat supported Copilot’s bullish outlook on the Bills.
Washington’s Lambeau drought: The claim that Washington hadn’t beaten Green Bay at Lambeau since 1988 is accurate. Public game logs and the Commanders’ own historical features confirm the last road win there was October 23, 1988.
Titans QB Cam Ward’s sack total: Reporting shows the rookie was sacked a league‑high six times in his debut, tying a dubious record for a No. 1 overall pick. This validated Copilot’s concern about Tennessee’s offensive line.
Patriots’ Miami struggles: The assertion that New England is 2‑10 at Hard Rock Stadium since 2013 and hasn’t won there since 2019 aligns with historical head‑to‑head data.

One specific statistic in USA TODAY’s story—“home teams went 13‑5 in games played on Thursday during the 2024 NFL season”—could not be immediately validated by a single authoritative public ledger in the time available. While the trend is plausible, independent aggregation from official box‑score feeds is recommended before treating that figure as confirmed.

Strengths of a Copilot‑Driven Forecast Workflow

Speed and repeatability. Copilot produces a complete slate instantly when fed identical prompts, allowing newsrooms to generate consistent, explainable outputs fast.
Transparent, interrogable rationales. Because Copilot is conversational, editors can ask follow‑ups—“Why this pick?”—and get a structured heuristic answer. That supports editorial oversight and rapid revision.
Pattern consistency. The assistant reliably favors low‑variance priors—QB pedigree, trench play, coaching experience—making its reasoning predictable and often aligned with conventional wisdom.

Pitfalls and Limitations

Stale data and hallucinated roster facts. Copilot occasionally produced outdated injury or roster information; USA TODAY had to re‑prompt to correct errors. This manual verification step is non‑negotiable for responsible use.
Overconfidence in single numbers. Returning one score (e.g., “27‑20”) implies precise confidence when actual outcome distributions are wide. Probabilistic calibration or ensemble simulation would be far more informative.
Sensitivity to prompt framing. Small changes—asking for a winner only, a probability, or a three‑scenario forecast—materially change the output. That’s a usability hazard when publishers want standardized templates.

Editorial Best Practices When Publishing AI‑Assisted Picks

Always disclose model identity and data‑cutoff timestamps. Readers must know whether the assistant had access to week‑of injury reports.
Present calibrated outputs. Convert single‑score predictions into probability ranges (win probability, expected points distribution) or show alternate scenarios (best case, worst case, most likely).
Human‑in‑the‑loop verification. Validate any roster‑level or injury claim against team releases, beat reporting, or the NFL’s official injury report before publication. This was an explicit corrective step in USA TODAY’s workflow.
Avoid amplifying unverified model claims into betting markets without explicit caveats. Widely republished AI picks can influence market behavior.

Technical Underpinnings: Why Copilot Behaves This Way

Copilot is a conversational large language model layered on retrieval and knowledge sources. Its behavior in this experiment reflects three technical realities:

Retrieval latency: If a fast‑moving roster update wasn’t present in Copilot’s retrieval index or prompt context, predictions used older priors. That’s why USA TODAY sometimes re‑prompted after corrections.
Heuristic synthesis: The model converts textual priors (coach reputation, QB history, press reports) into crisp rationales; it is not inherently probabilistic unless prompted to simulate distributions. This leads to plausible but overconfident single‑point forecasts.
Natural tendency to default to prototypical scores: Without an explicit instruction to model variance or run Monte Carlo simulations, Copilot supplies round, “typical” football scores (mid‑to‑high 20s for winners) rather than a calibrated interval.

Practical Implications for Readers, Bettors, Teams, and Editors

Readers: Treat Copilot’s single numbers as hypotheses, not certainties. Use them as a conversation starter rather than a final predictive authority.
Bettors: Don’t rely on one AI’s single‑score output for wagering. Compare AI picks with market odds, injury reports, and probabilistic models that explicitly model variance.
Teams and coaches: Copilot‑style assistants may prove valuable as rapid evidence aggregators on the sideline (clip pulls, personnel matchups). But operational controls, provenance metadata, and human oversight are essential. The NFL and Microsoft are already planning such guardrails as they expand sideline use of the technology.
Editors: Standardize prompts, automatically append the latest injury/practice reports, and always require human confirmation before publishing any claim that could move markets.

Ethical and Operational Risks

Market impact and feedback loops. Widely published, deterministic AI picks could shift betting markets in predictable ways, potentially altering future model inputs and creating reinforcement loops. Disclosing uncertainty can mitigate this risk.
Reputation risk from factual errors. If an assistant asserts a player will play when they are inactive, outlets face legal exposure and reputational damage. Manual verification is an editorial imperative.
Vendor lock‑in and governance. As leagues embed a single vendor’s copilots into mission‑critical workflows, governance processes are needed for provenance, privacy, and data‑use agreements with players and teams.

What This Experiment Teaches Us

USA TODAY’s Week 1 and Week 2 Copilot experiments are disciplined demonstrations of both the promise and limitations of conversational AI in sports journalism. The assistant consistently reasons in ways that mirror intuitive human analysis—valuing quarterback pedigree, defensive strength, and home‑field effects—and it does so at scale with transparent rationales. That makes it a powerful editorial tool for scenario generation and content velocity.

But the work also underlines a blunt truth: in fast‑moving, high‑variance domains like the NFL, data freshness, probabilistic calibration, and human verification are non‑negotiable. The single‑score outputs that read confidently in print hide the uncertainty that bettors, teams, and readers need to make responsible decisions. When used properly—with provenance metadata, cross‑checks against official injury reports, and probabilistic framing—Copilot and tools like it can accelerate coverage and surface useful insights. Left unchecked, they risk amplifying stale facts and overstating confidence in inherently uncertain contests.

Key verifications in this piece—from Brock Purdy’s Week 2 long‑shot status to Josh Allen’s Week 1 totals and Cam Ward’s sack‑heavy debut—were cross‑checked against contemporary reporting and team recaps to ensure readers get not just the AI’s picks, but also a fact‑checked assessment of why those picks make sense (or don’t). If there’s one practical lesson from USA TODAY’s rollout, it’s this: treat generative assistants as scenario engines—fast, explainable hypothesis generators—and not as single‑line authorities. With the right human processes layered on top, Copilot can help editors cover more ground faster; without those processes, it’s simply an eloquent oracle that can confidently state yesterday’s facts as today’s certainties.