Engineering Productivity Metrics in the AI Era: What Actually Matters

Engineering productivity measurement was already hard. AI coding tools made it harder. The metrics that worked in 2022 — output volume, story points, PRs per week — are now actively misleading. Anyone can produce a lot of code with Cursor. The question is whether the code that ships is the right code, ships safely, and creates the outcomes the business cares about.

This article walks the DORA quad and why it is necessary but no longer sufficient. Then SPACE. Then the AI-specific gotchas. Then a sample dashboard you can actually implement, with a warning about Goodhart's law at the end.

DORA: still the foundation

The four DORA metrics, refined over a decade of research, remain the right starting point:

| Metric | What it measures | 2026 elite benchmark | |--------|------------------|----------------------| | Deployment frequency | How often you ship to production | Multiple times per day | | Lead time for changes | Time from commit to production | Under one hour | | Change failure rate | Percent of deploys causing degradation | Under 15 percent | | Mean time to restore | Time to recover from incident | Under one hour |

These are still the right metrics. With AI assistance, you should see:

Lead time drop (less time on coding, less time on review with PR agents)
Deploy frequency rise (less friction per change)
Change fail rate hold steady or improve (PR agents catching obvious bugs)
MTTR drop (AIOps tools assisting incident response)

If you have AI everywhere and your DORA numbers are flat, the AI tooling is not creating leverage. That is a finding, not a failure of measurement. Our deployment frequency improvement playbook digs into the upstream blockers when this happens.

Why DORA alone is not enough

DORA tells you the system is performing. It does not tell you:

Whether developers are productive at the individual level
Whether the work being shipped is the right work
Whether your engineers are burning out from the speed-up
Whether AI tools are creating new categories of defects you have not yet detected
Whether developers trust and want to keep using the AI tools

This is where SPACE comes in.

SPACE: the five dimensions

The SPACE framework (Forsgren et al., 2021) gives you five complementary dimensions:

Satisfaction and well-being — Are developers satisfied with their work?
Performance — Quality and impact of outcomes
Activity — Volume of work (use sparingly, never alone)
Communication and collaboration — Team-level interaction quality
Efficiency and flow — Ability to make progress without interruption

The point of SPACE is not that you measure all five — it is that you avoid measuring only one. A single-dimension dashboard always lies. A balanced dashboard at least has internal contradictions you can investigate.

AI-specific gotchas

The new metrics conversation has specific traps:

Trap 1: Suggestion acceptance rate is a vanity metric

GitHub publishes acceptance rate. So does Cursor. Both show 30 to 40 percent across users. This number tells you almost nothing about whether the AI is creating leverage. A developer can accept a suggestion that is wrong, then spend 15 minutes fixing it. Acceptance happened. Leverage did not.

What to measure instead: time-to-merge of AI-assisted PRs vs non-AI-assisted PRs, escape defect rate in the same segments.

Trap 2: Raw output volume is misleading

Lines of code per developer per week. Number of PRs. These metrics worked even less well pre-AI. They work zero now. A developer can generate 5000 lines of Cursor output in a day. None of it might solve the actual problem.

What to measure instead: outcome metrics — features shipped, customer-reported bugs fixed, business KPIs moved.

Trap 3: "AI-generated PR" tracking matters but is hard

You want to know which PRs were AI-assisted, to compare against non-AI PRs. But:

Most developers use AI for some of the diff, not all of it
No accurate way to mark "this PR was 60 percent AI" exists
Self-reporting is unreliable
IDE telemetry exists but does not flow to your PR tracker

Pragmatic answer: ask developers to label PRs with one of {none, light, heavy} AI usage. Accept that this is imprecise. Use it as a directional signal only.

Trap 4: Developer sentiment lags by months

When you roll out a new AI tool, the first month of survey data is negative — change fatigue. The third month is positive — adaptation. The sixth month is the real baseline. If you measure at month one and pull the tool, you wasted the rollout.

What to do: commit to a six-month measurement window before reaching conclusions on developer-impact metrics.

The metrics you should add for AI

Beyond DORA and SPACE, four AI-specific additions:

1. Escape defect rate, segmented

Of all bugs reported by customers in 30 days, what percent came from PRs in each segment: human-only, AI-assisted, AI-heavy. If the AI segment is materially worse, you have a quality problem hidden inside the velocity gain.

2. Time on cognition-heavy work

Survey-based, quarterly. "What percent of your week was spent on work that required deep focus and original thought?" The hypothesis is that AI should increase this — by absorbing the rote work. If it is decreasing, your developers are getting captured by AI babysitting instead of liberated by it.

3. Flow time

Instrumented via IDE telemetry where possible. Duration of focused coding sessions without interruption (no Slack, no email, no meetings). Aggregate per developer per week. AI-augmented workflows should increase this. Meeting load and Slack noise still kill it.

4. Developer trust score

Single quarterly question, Likert 1-5: "The AI tools I use make me a more effective engineer." Trend it. Watch for divergence by team — if frontend says 4 and backend says 2, you have a tool-fit issue.

A sample 8-metric dashboard

Here is an eight-metric dashboard for an AI-augmented engineering org. Eight is enough. Twelve is too many. Four hides too much.

| # | Metric | Frequency | Source | |---|--------|-----------|--------| | 1 | Deployment frequency | Daily | CI/CD | | 2 | Lead time for changes | Daily | Git + CI/CD | | 3 | Change failure rate | Weekly | Incident tracker | | 4 | Mean time to restore | Weekly | Incident tracker | | 5 | Escape defect rate by PR segment | Monthly | Bug tracker + PR labels | | 6 | Developer satisfaction with AI tools | Quarterly | Survey | | 7 | Flow time per engineer per week | Weekly | IDE telemetry | | 8 | Percent of week on cognition-heavy work | Quarterly | Survey |

The mix is intentional: four lagging system metrics (DORA), one composite quality metric (escape rate), and three developer-experience metrics. The DORA four tell you the system is working. The other four tell you whether the humans inside the system are actually thriving.

Where to source the data

Git — Direct query against GitHub or your VCS API
CI/CD — Your runner exports duration and outcome
Bug tracker — Jira, Linear, or wherever bugs land
PR labels — Convention or a small bot that prompts the author on PR open
IDE telemetry — GitHub Copilot Metrics API, Cursor Analytics, Wakatime, or self-built
Survey — CultureAmp, Lattice, or a simple Google Form quarterly

You can wire all of this into PostHog, Mixpanel, your data warehouse, or a Metabase dashboard. The tool matters less than the discipline of looking at the dashboard weekly.

Goodhart's law: the warning

Every metric you publish becomes a target. Every target gets optimized. Many of those optimizations are gameable.

Publish lead time, and PRs get artificially small to drop the number
Publish deploy frequency, and trivial config changes get split into separate deploys
Publish AI acceptance rate, and developers click Accept and then immediately rewrite
Publish flow time, and developers learn to not close their laptop during meetings

The mitigations:

Use balanced metrics: any single metric can be gamed, but a Goodhart attack on three at once usually creates contradictions
Pair leading and lagging: if lead time drops but escape rate rises, you are gaming
Watch the variance, not just the mean: gaming often shows up as a tight cluster around the target
Talk to humans: ask engineers "how are you doing" in 1:1s, the qualitative signal catches gaming before the quantitative

What not to measure

A short list of metrics that look reasonable and are not:

Lines of code per developer
Commits per day
AI suggestion acceptance rate (use only for tool tuning, never for performance)
Story points completed (especially across teams)
Hours logged
PR count without size and outcome adjustment

Putting any of these on a public dashboard guarantees the wrong behavior.

A measurement cadence

A practical rhythm:

Daily: System metrics auto-refresh, no human review needed
Weekly: 15-minute team review of DORA quad, segmented escape rate trend
Monthly: Cross-team review, look for divergence, identify investigations
Quarterly: Developer survey, full dashboard review, set targets for next quarter

The quarterly review is the most important. That is where you decide what to measure differently next quarter, what to retire, what new question matters.

How to handle the executive ask: "show me developer productivity went up"

The CFO wants a number. The CEO wants a graph. The board wants a story. None of these audiences will sit through a 30-minute explanation of why SPACE is more nuanced than a single bar chart.

The framing that works:

Lead with outcome metrics: "We are deploying 2.5x more often with the same failure rate." Outcomes the business cares about.
Use a small basket of metrics, never a single number: Single numbers get questioned. A basket of three tells a story.
Show variance, not just averages: "Our top quartile is shipping 4x more, our bottom quartile is shipping 1.5x more." Honest.
Acknowledge the limits: "These metrics measure system performance, not individual effort. We supplement with developer surveys for the human side."

The executive who gets a clean, defensible, balanced metric story trusts engineering. The one who gets a single inflated number and questions it later does not.

Team-level vs individual-level measurement

A clear rule: never publish individual engineer metrics. Ever.

DORA metrics aggregate at the team or service level
SPACE metrics aggregate at the team level (with the satisfaction dimension at individual aggregate)
Quality metrics aggregate at the service or repo level
Productivity comparisons happen at the team-to-team level, with caveats

Individual performance assessment happens through 1:1 conversations, peer review, and outcome attribution — not through dashboards. The moment you publish individual engineer metrics, you have built an environment where everyone games the metric. The metric stops measuring anything real.

Special cases

A few situations that need adapted measurement:

Platform and infrastructure teams

Platform teams ship to other teams, not to customers. DORA still applies but the "customer" is an internal team. Add:

Developer time saved per quarter (survey-based)
Adoption rate of platform services
Self-service success rate (issues resolved without platform team involvement)

Research and exploratory teams

R&D teams should not be measured on DORA at all. The work is fundamentally different. Use:

Insights documented per quarter
Experiments concluded per quarter
Production features influenced by research output

On-call and incident response teams

DORA's MTTR is necessary. Add:

Pages per engineer per week (workload)
Pages resolved without escalation
Toil identified and eliminated per quarter
On-call satisfaction (specifically that dimension of SPACE)

Connecting metrics back to ROI

The hard conversation is always: "what did the AI tools actually get us?" The answer should not be "acceptance rate is 35 percent." It should be a clear chain:

Lead time dropped from 3.2 days to 1.8 days
Change failure rate held at 12 percent
Deploy frequency rose from 4x to 11x per day
Escape defects per release held steady
Developer satisfaction with AI tools held at 4.0 out of 5
Therefore the tools are creating leverage, not just activity

If you are sharpening this calculation specifically for Copilot, the Copilot ROI measurement guide walks the financial model in more depth.

Next steps

Pick four metrics from the eight-metric dashboard above. Implement those well, with weekly review and clear ownership. Add the next four over the following two quarters. Resist the urge to measure everything from day one — partial measurement well-done beats comprehensive measurement done poorly. If you want help wiring the dashboard or facilitating the quarterly review, reach out and we can help shape the cadence.

View All Insights