Measuring AI Engineering Productivity Without Fooling Yourself

I am running an internal AI developer platform for roughly 25 engineers right now. It pulls from Confluence, Azure DevOps, GitHub, and internal databases to let engineers ask grounded questions about our codebase and runbooks. I am in the early months of trying to measure whether it actually changes how the team works, and I am being more careful about the numbers than I used to be.

What I am not measuring

Tool adoption. Most AI productivity reports I see internally and from peers are measuring this: prompts per engineer per week, percentage of code with AI involvement, Copilot acceptance rates. Adoption tells you whether engineers opened the tool. It does not tell you whether their work got better, faster, or more correct.

Self-reported productivity. Engineers reliably report that AI tools save them time. That is worth knowing, but it is not a measurement. It is a sentiment. The correlation between self-reported time savings and actual cycle time is weaker than you would hope.

What I am trying to measure

Cycle time, by work type. Bug fix, small feature, larger feature, infrastructure, on-call. AI tooling helps with some of these much more than others. Aggregating across them washes out the signal.

Quality at the boundary. Defects found in code review and post-deploy issues, normalized for code change volume. If the tool is helping people ship faster, that should not come with a quality cost. If it does, the productivity gain is fake.

Support ticket triage time. This is one of the workflows I am most interested in. Engineering teams spend real time on support overflow. AI tooling that can do first-pass triage - pull relevant context, suggest a likely cause, route to the right owner - has the highest ROI in my experience, because the baseline is bad.

Engineer time spent on context-gathering. This is harder to measure directly, but proxies exist: how often engineers search wikis, how often they ping each other for runbook information, how often they end up reading the wrong document. The platform is meant to reduce this. I want to know whether it does.

What I have learned so far

The platform has to earn the second use. It is easy to get engineers to try a new tool once. Getting them to use it on their actual third task of the day is the bar. If the answers are wrong, partial, or unclear about their sources, engineers stop trusting the tool.

Citations matter more than answer quality. A roughly correct answer with sources is more useful than a more correct answer without them, because engineers can verify and adjust. This shaped how I built the retrieval and response layer.

The hard part is the data, not the model. The biggest single factor in whether the platform is useful is whether the underlying documentation, ADRs, runbooks, and code comments are good. AI tooling makes good documentation more valuable and bad documentation more visible.

What I am not ready to claim yet

I would want at least two quarters of data before I put a multiplier number on this work. Most published numbers in this space were measured too early, in conditions that do not generalize. I would rather under-claim now and have a defensible number later than overclaim and have to walk it back.