Engineering Productivity Metrics in the AI Era: What Actually Matters
DORA, SPACE, and the new metrics you need after Copilot — what to measure, what to ignore, and how to avoid Goodharting your team.
- PUBLISHED
- April 12, 2026
- READ TIME
- 11 MIN
- AUTHOR
- ONE FREQUENCY
Engineering productivity measurement was already hard. AI coding tools made it harder. The metrics that worked in 2022 — output volume, story points, PRs per week — are now actively misleading. Anyone can produce a lot of code with Cursor. The question is whether the code that ships is the right code, ships safely, and creates the outcomes the business cares about.
This article walks the DORA quad and why it is necessary but no longer sufficient. Then SPACE. Then the AI-specific gotchas. Then a sample dashboard you can actually implement, with a warning about Goodhart's law at the end.
DORA: still the foundation
The four DORA metrics, refined over a decade of research, remain the right starting point:
| Metric | What it measures | 2026 elite benchmark | |--------|------------------|----------------------| | Deployment frequency | How often you ship to production | Multiple times per day | | Lead time for changes | Time from commit to production | Under one hour | | Change failure rate | Percent of deploys causing degradation | Under 15 percent | | Mean time to restore | Time to recover from incident | Under one hour |
These are still the right metrics. With AI assistance, you should see:
- Lead time drop (less time on coding, less time on review with PR agents)
- Deploy frequency rise (less friction per change)
- Change fail rate hold steady or improve (PR agents catching obvious bugs)
- MTTR drop (AIOps tools assisting incident response)
If you have AI everywhere and your DORA numbers are flat, the AI tooling is not creating leverage. That is a finding, not a failure of measurement. Our deployment frequency improvement playbook digs into the upstream blockers when this happens.
Why DORA alone is not enough
DORA tells you the system is performing. It does not tell you:
- Whether developers are productive at the individual level
- Whether the work being shipped is the right work
- Whether your engineers are burning out from the speed-up
- Whether AI tools are creating new categories of defects you have not yet detected
- Whether developers trust and want to keep using the AI tools
This is where SPACE comes in.
SPACE: the five dimensions
The SPACE framework (Forsgren et al., 2021) gives you five complementary dimensions:
- Satisfaction and well-being — Are developers satisfied with their work?
- Performance — Quality and impact of outcomes
- Activity — Volume of work (use sparingly, never alone)
- Communication and collaboration — Team-level interaction quality
- Efficiency and flow — Ability to make progress without interruption
The point of SPACE is not that you measure all five — it is that you avoid measuring only one. A single-dimension dashboard always lies. A balanced dashboard at least has internal contradictions you can investigate.
AI-specific gotchas
The new metrics conversation has specific traps:
Trap 1: Suggestion acceptance rate is a vanity metric
GitHub publishes acceptance rate. So does Cursor. Both show 30 to 40 percent across users. This number tells you almost nothing about whether the AI is creating leverage. A developer can accept a suggestion that is wrong, then spend 15 minutes fixing it. Acceptance happened. Leverage did not.
What to measure instead: time-to-merge of AI-assisted PRs vs non-AI-assisted PRs, escape defect rate in the same segments.
Trap 2: Raw output volume is misleading
Lines of code per developer per week. Number of PRs. These metrics worked even less well pre-AI. They work zero now. A developer can generate 5000 lines of Cursor output in a day. None of it might solve the actual problem.
What to measure instead: outcome metrics — features shipped, customer-reported bugs fixed, business KPIs moved.
Trap 3: "AI-generated PR" tracking matters but is hard
You want to know which PRs were AI-assisted, to compare against non-AI PRs. But:
- Most developers use AI for some of the diff, not all of it
- No accurate way to mark "this PR was 60 percent AI" exists
- Self-reporting is unreliable
- IDE telemetry exists but does not flow to your PR tracker
Pragmatic answer: ask developers to label PRs with one of {none, light, heavy} AI usage. Accept that this is imprecise. Use it as a directional signal only.
Trap 4: Developer sentiment lags by months
When you roll out a new AI tool, the first month of survey data is negative — change fatigue. The third month is positive — adaptation. The sixth month is the real baseline. If you measure at month one and pull the tool, you wasted the rollout.
What to do: commit to a six-month measurement window before reaching conclusions on developer-impact metrics.
The metrics you should add for AI
Beyond DORA and SPACE, four AI-specific additions:
1. Escape defect rate, segmented
Of all bugs reported by customers in 30 days, what percent came from PRs in each segment: human-only, AI-assisted, AI-heavy. If the AI segment is materially worse, you have a quality problem hidden inside the velocity gain.
2. Time on cognition-heavy work
Survey-based, quarterly. "What percent of your week was spent on work that required deep focus and original thought?" The hypothesis is that AI should increase this — by absorbing the rote work. If it is decreasing, your developers are getting captured by AI babysitting instead of liberated by it.
3. Flow time
Instrumented via IDE telemetry where possible. Duration of focused coding sessions without interruption (no Slack, no email, no meetings). Aggregate per developer per week. AI-augmented workflows should increase this. Meeting load and Slack noise still kill it.
4. Developer trust score
Single quarterly question, Likert 1-5: "The AI tools I use make me a more effective engineer." Trend it. Watch for divergence by team — if frontend says 4 and backend says 2, you have a tool-fit issue.
A sample 8-metric dashboard
Here is an eight-metric dashboard for an AI-augmented engineering org. Eight is enough. Twelve is too many. Four hides too much.
| # | Metric | Frequency | Source | |---|--------|-----------|--------| | 1 | Deployment frequency | Daily | CI/CD | | 2 | Lead time for changes | Daily | Git + CI/CD | | 3 | Change failure rate | Weekly | Incident tracker | | 4 | Mean time to restore | Weekly | Incident tracker | | 5 | Escape defect rate by PR segment | Monthly | Bug tracker + PR labels | | 6 | Developer satisfaction with AI tools | Quarterly | Survey | | 7 | Flow time per engineer per week | Weekly | IDE telemetry | | 8 | Percent of week on cognition-heavy work | Quarterly | Survey |
The mix is intentional: four lagging system metrics (DORA), one composite quality metric (escape rate), and three developer-experience metrics. The DORA four tell you the system is working. The other four tell you whether the humans inside the system are actually thriving.
Where to source the data
- Git — Direct query against GitHub or your VCS API
- CI/CD — Your runner exports duration and outcome
- Bug tracker — Jira, Linear, or wherever bugs land
- PR labels — Convention or a small bot that prompts the author on PR open
- IDE telemetry — GitHub Copilot Metrics API, Cursor Analytics, Wakatime, or self-built
- Survey — CultureAmp, Lattice, or a simple Google Form quarterly
You can wire all of this into PostHog, Mixpanel, your data warehouse, or a Metabase dashboard. The tool matters less than the discipline of looking at the dashboard weekly.
Goodhart's law: the warning
Every metric you publish becomes a target. Every target gets optimized. Many of those optimizations are gameable.
- Publish lead time, and PRs get artificially small to drop the number
- Publish deploy frequency, and trivial config changes get split into separate deploys
- Publish AI acceptance rate, and developers click Accept and then immediately rewrite
- Publish flow time, and developers learn to not close their laptop during meetings
The mitigations:
- Use balanced metrics: any single metric can be gamed, but a Goodhart attack on three at once usually creates contradictions
- Pair leading and lagging: if lead time drops but escape rate rises, you are gaming
- Watch the variance, not just the mean: gaming often shows up as a tight cluster around the target
- Talk to humans: ask engineers "how are you doing" in 1:1s, the qualitative signal catches gaming before the quantitative
What not to measure
A short list of metrics that look reasonable and are not:
- Lines of code per developer
- Commits per day
- AI suggestion acceptance rate (use only for tool tuning, never for performance)
- Story points completed (especially across teams)
- Hours logged
- PR count without size and outcome adjustment
Putting any of these on a public dashboard guarantees the wrong behavior.
A measurement cadence
A practical rhythm:
- Daily: System metrics auto-refresh, no human review needed
- Weekly: 15-minute team review of DORA quad, segmented escape rate trend
- Monthly: Cross-team review, look for divergence, identify investigations
- Quarterly: Developer survey, full dashboard review, set targets for next quarter
The quarterly review is the most important. That is where you decide what to measure differently next quarter, what to retire, what new question matters.
How to handle the executive ask: "show me developer productivity went up"
The CFO wants a number. The CEO wants a graph. The board wants a story. None of these audiences will sit through a 30-minute explanation of why SPACE is more nuanced than a single bar chart.
The framing that works:
- Lead with outcome metrics: "We are deploying 2.5x more often with the same failure rate." Outcomes the business cares about.
- Use a small basket of metrics, never a single number: Single numbers get questioned. A basket of three tells a story.
- Show variance, not just averages: "Our top quartile is shipping 4x more, our bottom quartile is shipping 1.5x more." Honest.
- Acknowledge the limits: "These metrics measure system performance, not individual effort. We supplement with developer surveys for the human side."
The executive who gets a clean, defensible, balanced metric story trusts engineering. The one who gets a single inflated number and questions it later does not.
Team-level vs individual-level measurement
A clear rule: never publish individual engineer metrics. Ever.
- DORA metrics aggregate at the team or service level
- SPACE metrics aggregate at the team level (with the satisfaction dimension at individual aggregate)
- Quality metrics aggregate at the service or repo level
- Productivity comparisons happen at the team-to-team level, with caveats
Individual performance assessment happens through 1:1 conversations, peer review, and outcome attribution — not through dashboards. The moment you publish individual engineer metrics, you have built an environment where everyone games the metric. The metric stops measuring anything real.
Special cases
A few situations that need adapted measurement:
Platform and infrastructure teams
Platform teams ship to other teams, not to customers. DORA still applies but the "customer" is an internal team. Add:
- Developer time saved per quarter (survey-based)
- Adoption rate of platform services
- Self-service success rate (issues resolved without platform team involvement)
Research and exploratory teams
R&D teams should not be measured on DORA at all. The work is fundamentally different. Use:
- Insights documented per quarter
- Experiments concluded per quarter
- Production features influenced by research output
On-call and incident response teams
DORA's MTTR is necessary. Add:
- Pages per engineer per week (workload)
- Pages resolved without escalation
- Toil identified and eliminated per quarter
- On-call satisfaction (specifically that dimension of SPACE)
Connecting metrics back to ROI
The hard conversation is always: "what did the AI tools actually get us?" The answer should not be "acceptance rate is 35 percent." It should be a clear chain:
- Lead time dropped from 3.2 days to 1.8 days
- Change failure rate held at 12 percent
- Deploy frequency rose from 4x to 11x per day
- Escape defects per release held steady
- Developer satisfaction with AI tools held at 4.0 out of 5
- Therefore the tools are creating leverage, not just activity
If you are sharpening this calculation specifically for Copilot, the Copilot ROI measurement guide walks the financial model in more depth.
Next steps
Pick four metrics from the eight-metric dashboard above. Implement those well, with weekly review and clear ownership. Add the next four over the following two quarters. Resist the urge to measure everything from day one — partial measurement well-done beats comprehensive measurement done poorly. If you want help wiring the dashboard or facilitating the quarterly review, reach out and we can help shape the cadence.
Ready to ship the next outcome?
One Frequency Consulting brings 25+ years of technology leadership and military discipline to every engagement. First call is operator-grade scoping — sixty minutes, no charge.