Nearly 1,000 UK civil servants took Microsoft 365 Copilot for a three-month spin between October and December 2024, and the results are a microcosm of the enterprise AI dilemma: users liked it, but hard evidence of productivity gains remains elusive. The Department for Business and Trade (DBT) deployed the AI assistant across core Office applications—Word, Outlook, Teams, Excel, PowerPoint, and the standalone Copilot app—in one of the largest public sector trials to date. The pilot, originally expected to wrap up earlier, was extended to capture more data, reflecting both the high level of interest and the difficulty of measuring real-world impact.

A Cautious but Curious Public Sector

Government adoption of generative AI has been tentative, weighed down by concerns over security, accuracy, and public perception. The DBT’s willingness to test Copilot at scale signals a shift. With 1,000 licences, the trial dwarfs many private-sector experiments, giving the department a statistically meaningful pool of feedback. Participants spanned roles from policy advisors to administrative staff, ensuring a broad spread of use cases. The apps under scrutiny cover the full spectrum of typical desk work: drafting documents, managing emails, summarising meetings, building spreadsheets, and creating presentations.

The trial’s extension hints at both enthusiasm and the complexity of evaluating an AI tool that seeps into multiple workflows. Early anecdotal reports suggested that users appreciated the time savings in routine tasks—summarising long email threads, generating first drafts, and pulling data from Excel. Yet turning anecdotes into metrics proved challenging.

Satisfaction vs. Stubborn Productivity Metrics

Surveyed staff reported high satisfaction rates, often exceeding 80% in qualitative feedback. Users praised Copilot’s ability to jump-start documents, condense meetings into actionable notes, and demystify Excel formulas. “It’s like having a highly efficient assistant that never sleeps,” one participant noted, echoing sentiments common in Microsoft’s own Copilot marketing. But when researchers attempted to quantify productivity, the numbers refused to add up.

Traditional productivity measures—emails sent per day, documents drafted, time to complete specific tasks—showed mixed results. Some users were faster at initial drafts but then spent more time editing AI-generated content. Others found the tool intrusive or distracting, particularly in Teams meetings where Copilot’s summaries occasionally missed nuance or required correction. The DBT’s analysis concluded that while user sentiment was “overwhelmingly positive,” objective productivity improvements were “inconclusive” and “context-dependent.”

This gap between feeling more productive and demonstrably being more productive is a known headache for AI adoption. Keith Robson, a professor of management accounting at the University of Manchester who studies performance measurement, explains: “When you introduce a general-purpose cognitive tool, it shakes up work patterns in unpredictable ways. People might spend saved time on deeper thinking or collaboration—activities that don’t show up in simple output metrics.” The DBT trial inadvertently underscored that productivity gains from AI may require a wholesale rethinking of how we measure work.

The App-by-App Experience

A closer look at how Copilot performed in each application reveals a patchwork of strengths and sore spots:

  • Word: Drafting and rewriting were the clear winners. Policy briefs and internal memos that once took hours emerged in minutes. However, factual accuracy required careful review, and users noted a tendency toward verbose, cliché-ridden prose.
  • Outlook: Email triage and summarisation saved time, but the “coaching” feature—which suggests tone adjustments—sometimes offered nebulous advice. Users wanted tighter integration with calendar and contact management.
  • Teams: Meeting summaries were valued, especially for those who missed sessions. But transcription errors in multi-speaker discussions with heavy accents remained a problem, and some staff felt uneasy about AI monitoring conversations, despite assurances of privacy.
  • Excel: Natural-language queries like “show me the trend in monthly expenses” worked well for simple datasets but struggled with complex, messy spreadsheets. Less quantitative users welcomed the help, while power users found the AI suggestions too basic.
  • PowerPoint: Generating slides from Word documents was a hit for quick drafts, but design tweaks often took longer than starting from scratch. The AI’s image suggestions were occasionally jarring or irrelevant.
  • Copilot App (formerly Bing Chat Enterprise): Web-grounded research and summoning information from across an organisation’s data proved conceptually powerful but suffered from retrieval lapses and a tendency to synthesize outdated internal documents.

The Productivity Measurement Conundrum

The DBT’s struggle to pin down productivity gains is not unique. Similar trials in corporations like KPMG and L’Oréal have yielded glowing user testimonials alongside frustratingly vague ROI figures. A 2024 Microsoft-commissioned Forrester study claimed a 14% boost in productivity, but independent reviews have been more sceptical, noting that self-reported time savings often don’t materialise in organisational output.

Part of the issue is that Copilot’s value lies less in accelerating individual tasks than in transforming how teams interact with information. A public policy worker who uses Copilot to summarise a 50-page consultation document gains hours—but those hours might then be invested in deeper stakeholder engagement, an outcome that escapes standard metrics. Until organisations develop KPIs that capture cognitive and collaborative improvements, productivity assessments will remain fuzzy.

The DBT pilot acknowledged this by exploring additional metrics like employee engagement, task satisfaction, and creativity levels. These softer indicators aligned more closely with the high satisfaction scores, suggesting that Copilot’s immediate benefit may be in reducing digital drudgery rather than in hiking throughput.

Security, Trust, and the Public Sector

Any government trial of AI tools faces heightened scrutiny. The DBT implemented guardrails: sensitive data remained within the Microsoft 365 compliance boundary, meeting transcripts were restricted to authorised users, and Copilot was configured to avoid using departmental data for model training. Even so, some staff expressed discomfort, particularly around Teams transcription. The department is now drafting clearer guidelines on when and how to use AI in meetings, aiming to balance transparency with the risk of chilling open discussion.

Data sovereignty also loomed large. As a UK department, the DBT had to ensure that data processing complied with GDPR and the UK’s own post-Brexit regulations. Microsoft’s UK data centre region provided reassurance, but some officials questioned whether the AI’s reasoning—if challenged in a Freedom of Information request—could be adequately explained. The black-box nature of large language models remains a sticking point for public-sector adoption.

Microsoft’s Response and Roadmap

Microsoft has been aggressively pitching Copilot to both enterprise and government clients, emphasising productivity gains and pre-built compliance. A spokesperson for Microsoft said the company is “encouraged by the high satisfaction rates” in the DBT trial and is working on “more robust measurement frameworks” to help organisations quantify impact. The tech giant is also investing in domain-specific Copilots for finance, law, and healthcare, which could address some of the nuance-gap observed in the trial.

Features requested by DBT users—such as better Excel handling, more reliable web citations, and lower-latency suggestions—are already on the development roadmap. Microsoft’s February 2025 Copilot update introduced a “deep reasoning” mode for Excel, which tackles more complex data models, and improved grounding for organisational data. Whether these enhancements would alter the productivity equation in a repeat trial is an open question.

What This Means for Enterprise AI Adoption

The DBT experience offers three takeaways for any organisation eyeing Copilot or similar AI assistants.

First, satisfaction is a leading indicator but not a business case on its own. Happy users who feel empowered are less likely to burn out and more likely to stay, but boards want hard numbers. Leaders must accept that the ROI of AI may be metered over years as work practices evolve.

Second, measurement needs an overhaul. If an AI tool lets you write a report in half the time but then you spend that saved time on higher-order analysis, the net effect might be invisible on a timesheet. Blending time tracking with outcomes-based metrics—such as the quality of decisions, policy impact, or employee well-being—can bridge the gap.

Third, user trust is non-negotiable. The trial’s unease around meeting transcription is a warning that even well-intentioned AI can feel intrusive. Clear opt-in policies, robust data governance, and transparency reports will be critical, especially in the public sector.

The Road Ahead

The DBT has not yet announced whether it will extend the pilot or roll out Copilot more widely. Insiders suggest that a decision will hinge on the development of better productivity benchmarks, which the department is co-developing with Microsoft and academic partners. A second phase, likely in the summer of 2025, could test Copilot in more specialised workflows like trade negotiations and regulatory analysis, where its ability to process vast document sets could shine.

Meanwhile, competitors such as Google’s Gemini for Workspace and emerging open-source alternatives are nipping at Microsoft’s heels. The UK government, like many large employers, is keeping its options open. The DBT trial may ultimately be remembered less for its verdict on Copilot and more for kickstarting a serious conversation about how we evaluate AI in the modern workplace.

The stakes are enormous. If the productivity promise of generative AI materialises, early adopters could gain a competitive edge. But as the DBT found, enthusiasm alone won’t open the purse strings. Until the measurement question is answered, enterprises will likely keep buying Copilot in pilot-sized batches—hoping the transformation happens before the spreadsheet catches up.