When Microsoft Copilot writes Python code to crunch your spreadsheet numbers, it delivers a computation, not a guarantee. That distinction, underscored by a recent spate of public demo errors and model limitations, is reshaping how enterprises and journalists approach AI-assisted data analysis. The technology promises speed and democratization, but it also introduces subtle risks that demand a new discipline: read the code, verify the output, and govern the data.

Microsoft’s Copilot brand has ballooned into a family of products spanning GitHub, Microsoft 365, Windows, and Edge. Quietly, and sometimes loudly, Copilot has gained the ability to generate and execute code—mostly Python—as part of answering natural-language prompts. This change is monumental. When a large language model (LLM) produces text, you’re dealing with a probabilistic sequence of words. When it writes and runs code, you’re staring at the result of an actual computation. But that computation isn’t inherently correct. As the Online Journalism Blog notes, generative AI tools “are not calculators,” yet they increasingly impersonate them by generating, executing, and surfacing Python code directly within Excel workbooks and other Office apps.

This article dives into how Copilot performs data analysis today, where it excels, where it fails, and what every user—from the code-literate analyst to the non-technical reporter—must do to turn AI-assisted analysis into a trustworthy asset.

The Copilot Conundrum: Code-Generated Analysis vs. Numerical Accuracy

The core tension is simple: Copilot can now run code, so its outputs look more authoritative than ever, but the underlying model can still produce flawed logic. In February 2025, Microsoft pushed Python-driven Copilot features to a broad set of Windows and web users, explicitly marketing the capability as a way to “gain deeper insights without needing to be a Python expert.” The pitch is seductive: ask a plain-English question, get a chart or a forecast, skip the Python learning curve.

But the same month, public demonstrations of the latest general-purpose models—including OpenAI’s GPT-5—showed chart errors and arithmetic slips. Those incidents weren’t just about language prediction; they involved code that ran incorrectly and produced visuals that looked plausible but were mathematically wrong. The lesson isn’t that the models are useless. It’s that the addition of code execution raises the stakes: a mistake can propagate into a business decision, a published article, or a financial filing.

How Copilot Crunches Numbers: The Two-Step Dance

Under the hood, a typical Copilot data-analysis workflow follows a two-step anatomy:

  1. A user submits a natural-language prompt asking for an analysis, chart, or transformation.
  2. Copilot generates Python code (or a sequence of Excel formulas), often leaning on libraries like pandas, matplotlib, or Altair, then executes that code in a controlled runtime and returns the output—numbers, tables, or charts—plus an explanation of what it did.

Microsoft documents this explicitly for Copilot in Excel: ask for advanced analytics, and Copilot will “automatically generate, explain, and insert Python code into your Excel spreadsheet.” That visibility matters. Unlike earlier pure-text LLM responses, you can inspect the exact logic that produced the result. For the first time, AI-assisted analysis offers a path to auditability—if you choose to take it.

Strengths: Speed, Democratization, and Transparency

Copilot genuinely accelerates data work in several ways:

  • Rapid iteration: It can scaffold an analysis in seconds—load data, clean it, run an aggregate, and return a plot. For exploratory analysis, the speed is transformative.
  • Democratization of complex tools: Embedding Python in Excel and wiring Copilot to generate that code brings powerful libraries to analysts without formal coding training.
  • Transparency (when used correctly): Because Copilot often shows the code it executed, technically literate users can audit the logic, verify calculations, and spot errors—a stark improvement over opaque LLM text outputs.
  • Cross-product consistency: The same mental model applies across GitHub Copilot and Office Copilot, letting organizations leverage shared governance for both coding and spreadsheet work.

These strengths explain why adoption is surging. GitHub’s free Copilot tier and Microsoft’s integrated Office Copilots are lowering barriers. Yet the very feature that makes Copilot powerful—code generation—is also its Achilles’ heel.

The Hard Truth: Where Copilot Stumbles

Not a Calculator—and Not Infallible Math

LLMs remain probabilistic language models, not symbolic math engines. They can and do produce arithmetic or logical errors. Even sophisticated models have been caught making basic decimal or visualization mistakes in public demos. The Online Journalism Blog recalls that ChatGPT “used to be notoriously bad at maths. Then it got worse at maths. And the recent launch of its newest model, GPT-5, showed that it’s still bad at maths.” When a model writes code, those errors can become embedded in the runtime output, yielding assertive but wrong numbers.

Hidden Assumptions and Context Sensitivity

A Copilot-generated Python snippet may silently assume a specific data schema, default treatment of missing values, particular aggregation windows, or internal data sampling. Without explicit declarations, slightly different prompts or updated model versions can produce divergent numeric answers. This sensitivity means the same Copilot prompt across different tenants or locales might yield inconsistent results.

Hallucinations and Grounding Failures

When Copilot “grounds” answers in tenant data or web sources, it can still hallucinate supporting facts or misattribute numbers. Microsoft stores interaction records for Copilot activity history, and while the company states this data isn’t used to retrain foundation models, the existence of stored interactions and the system’s grounding behavior mean users must treat Copilot outputs as drafts, not verdicts.

  • Copilot interactions are recorded and stored (subject to tenant controls), and uploaded files may be retained briefly for processing. Microsoft encrypts stored prompts and responses under contractual commitments, but organizations with sensitive data must apply governance as they would for any external compute service.
  • Code suggestions from GitHub Copilot derive from training on public code, raising intellectual property and licensing questions for outputs that resemble training data. GitHub and Microsoft have published mitigating guidance, but downstream reuse still carries risk.

Lessons from the GPT-5 Era: More Capable, Still Error-Prone

Recent model launches teach two critical lessons for professionals who use AI for numbers:

  • Independent, multi-source verification is essential. Public demo slip-ups—and the post-hoc corrections that followed—prove that claims about model accuracy must be tested using independent datasets and methods. Relying on a single model output without cross-validation is risky.
  • Model improvements do not eliminate the need for human oversight. Reports praising reasoning gains often arrive alongside examples of arithmetic or visual mistakes. The net effect: improvements reduce but don’t remove error modes. Treat Copilot outputs as assisted analysis, not final decisions.

For high-stakes numbers, run at least two independent checks: re-run the analysis with a different prompt or tool, and reproduce the results manually if feasible.

Practical Rules for Safe Data Analysis with Copilot

  • Read the generated code before accepting results. Step through pandas operations, check group-bys, and ensure missing-value handling is explicit.
  • Use versioned runtimes and pinned libraries. Differences in library versions can alter results; pin versions or note the runtime used.
  • Write unit tests and assertions inside the notebook. Add sanity checks: totals should sum correctly, percentages should be bounded, row counts should match expectations.
  • Run a separate, auditable script. Export the Copilot-generated code into an orchestrated script with logging so the process is repeatable and auditable.
  • Ground outputs in raw data snapshots. Save input dataset versions with timestamps and checksums to prove what Copilot computed and when.
  • Prefer deterministic methods when possible. Seed random operations or avoid stochastic approaches in exploratory summaries.

If You Don’t Understand Code (or Don’t Want To)

  • Don’t use Copilot alone for critical numbers. If you can’t read and verify the code, treat Copilot as an ideation or visualization helper only; get a code-literate teammate to audit any computation.
  • Request verbose, step-by-step explanations and ask Copilot to “show the code and intermediate results.” But don’t rely on those explanations as proof—they are generated text and must be verified.
  • Limit inputs to non-sensitive, public datasets or sanitized extracts. Avoid uploading proprietary or personal data unless governance controls are in place.
  • Use built-in Excel formulas for final, auditable figures when feasible. Native Excel formulas are easier for non-coders to inspect than a Python block.

An Audit Checklist Before You Publish Numbers Derived from Copilot

  1. Confirm the raw data snapshot used by Copilot (file name, checksum, row count).
  2. Export the generated Python code and run it in a controlled environment you can inspect.
  3. Add sanity checks:
    - Totals and subtotals equal expected sums.
    - Date windows and time-zone assumptions are explicit.
    - No silent type coercions (strings parsed as numbers).
  4. Recompute the same result with a second method or tool (e.g., Excel pivot, SQL query).
  5. Verify any charts: axis scales match reported numbers, labels correspond to data columns.
  6. Ask Copilot to show its intermediate results (group-by tables, aggregated frames) and compare them to your independent run.
  7. Store the entire analysis package (data, code, outputs) in version control or a secure archive for later auditing.

Governance and Policy: What Organizations Should Demand

  • Explicit retention and deletion rules. Confirm how long Copilot stores prompts and outputs. Microsoft documents retention policies and provides controls to delete conversations. For sensitive datasets, require contractual guarantees and process isolation (e.g., on-prem or VNET-isolated runtimes) before allowing uploads.
  • Audit logs and exportability. Ensure the platform provides complete activity logs that can be exported for compliance reviews.
  • Model training assurances. Obtain written confirmation whether organization data will be used to improve foundation models. Microsoft states user-uploaded content is not used to train its foundation LLMs, though prompts and interactions are stored for activity history.
  • IP and licensing controls. For code produced by GitHub Copilot, review licensing guidance and legal opinion before republishing or shipping it in products.

When to Use Copilot—and When to Avoid It

Copilot is well suited to:
- Exploratory analysis, chart prototyping, and quick statistical summaries.
- Bridging a skill gap: helping analysts learn Python idioms by example.
- Generating complex visualizations that would otherwise require significant developer time.

Copilot is not suited to:
- Producing final, auditable numbers for regulatory filings or financial statements without strict verification.
- Handling highly sensitive PII/PHI without specialized privacy controls.
- Replacing a domain expert’s judgment in edge cases where context and nuance affect interpretation.

Augmented Judgment, Not Replacement

Generative AI assistants that create and execute code are transformative for data work—but they shift the locus of responsibility, not eliminate it. When a Copilot writes Python and returns a chart, you’ve been handed a calculation, not a guarantee. The value arises when the user treats that calculation as an audited product: inspect the code, rerun it under known conditions, and cross-verify with independent methods.

For journalists, analysts, and enterprises, the rules should be explicit: use Copilot to accelerate analysis, but maintain human oversight, reproducibility, and governance. The technology reduces friction—it does not remove the need for skepticism. The public incidents and model reviews of the last year prove the point: models have dramatically advanced, yet they still make avoidable numerical and representation errors. Those errors are no longer opaque hallucinations; they are code-level mistakes you can detect—if you know where to look.

As AI becomes embedded in the tools we use daily, the real differentiator won’t be which assistant can generate the slickest chart, but which users and organizations have built the discipline to verify what the assistant produces. Copilot is not the last word—it’s the first draft of an analysis that must be rigorously audited. That’s not a limitation of the technology; it’s a foundational requirement for responsible use.