All insights
Metrics
PMO
Governance

AI delivery metrics that actually matter

·8 min read

Every PMO measuring AI usage in 2026 is measuring the wrong things. Prompts per user. Hours saved (self-reported). Tool license utilisation. These are vanity metrics dressed up in delivery clothing. They tell you nothing about whether AI is making the work better — only that people are clicking the icon.

Here's what we'd measure instead, and why.

The vanity metrics, and what's wrong with them

  • Prompts per user per week. Measures activity, not outcomes. The PM doing 200 throwaway prompts looks more "engaged" than the one doing five careful ones for the steerco pack.
  • Self-reported hours saved. Inflated by social desirability, deflated by people who don't want extra work assigned to them, and impossible to validate. Useless for decisions.
  • Tool licence utilisation. Tells you who logged in, not what changed. Plenty of people log in daily and produce nothing different from six months ago.
  • NPS-style "AI helpfulness" scores. Measures mood, not impact. The same PM might rate it 9/10 in May and quietly stop using it by August.
The dashboard trap
PMO leaders ask for AI metrics because they need to report up. Vendors supply easy ones because they're easy to capture. The resulting dashboard looks like progress and reports like progress and is mostly noise.

What to measure instead

1. Cycle time on specific recurring artifacts

Pick three artifacts your PMs produce repeatedly — status pack, steerco brief, RAID review. Measure end-to-end time from "start" to "shipped" before AI was in the workflow, then after. Track quarterly.

This is harder than self-report, but it's the only honest answer to "is AI making this faster?" Where it didn't, ask why. Often the bottleneck wasn't drafting — it was approval cycles or stakeholder availability, which AI doesn't touch.

2. Quality drift on those same artifacts

Faster doesn't matter if quality dropped. Have someone independent — ideally a senior PM or the head of delivery — rate a sample of those artifacts blind. Pre-AI vs post-AI. Look for: specificity, signal density, presence of awkward truths the org needs to hear.

If quality went up, AI is genuinely helping. If quality stayed flat and speed went up, you've found leverage. If quality went down at all, you have a problem regardless of what the time-saved number says.

3. Decision velocity on flagged risks

How long does it take, on average, for a newly-flagged risk to get to a decision? AI should compress this — better drafts, faster synthesis, less re-work. If it isn't, AI is making PMs faster at producing stuff that doesn't move the work forward.

4. Stakeholder trust signals

Harder to instrument, but possible. Track:

  • Sponsor questions on packs (more = trust dropping).
  • Time from pack distribution to first reply (longer = less engaged).
  • Pack-driven decisions vs decisions deferred to follow-up conversations (more deferrals = packs aren't sufficient).

These move slowly, but they move. If your AI-augmented packs are producing more sponsor questions and more deferred decisions, that's the actual cost showing up.

5. AI failure incidents

Count, per quarter, the number of times an AI-produced or AI-influenced artifact required correction, retraction, or caused an escalation. This is the metric every PMO avoids tracking because it feels punitive. Track it anyway. Trending to zero is the goal; flatlining at non-zero is a culture problem; rising is a process problem.

6. Adoption depth, not breadth

Instead of "% of PMs using AI" (a number that hits 95% within weeks and then tells you nothing), measure how many PMs use AI for at least three distinct workflow steps per week. The depth number is much smaller, decays differently, and correlates much better with actual change.

A minimum viable AI dashboard

  1. Cycle time on three target artifacts, vs baseline.
  2. Blind quality score on the same artifacts, quarterly sampling.
  3. Decision velocity on flagged risks and changes.
  4. AI failure incident count, with trend line.
  5. Depth-of-use count (PMs using AI in 3+ workflow steps weekly).

Five numbers. Hard to gather. Hard to game. Actually tells you whether the AI investment is producing delivery outcomes or just producing activity.

The principle

Good metrics for AI in delivery are the same as good metrics for anything in delivery: they measure change in outcome, not change in activity. The metrics that are easy to capture — prompts, hours, NPS — capture activity. The ones that take effort capture outcomes. PMOs that pick the easy ones will keep wondering why their AI dashboard looks healthy while nothing visibly changes.