There always comes a time when engineering leadership is expected to speak to the performance of their team. Advisors, board members, and executive leadership will also expect some sort of quantitative measures to support the story. At this point in time engineering leaders will often reach for the DORA metric framework. The definitions of the metrics are simple and they are fairly easy to understand even for non-technical stakeholders. So it's a great place to start.
I really enjoy having this conversation because it forces introspection and makes for interesting conversations. However, in my experience, the conversation starts to get complicated once reality sets in. It almost always starts with the same assumption:
We have the data already, don't we just have to aggregate it?
It's a reasonable assumption to make. There’s git commit history. There are pull requests. CI workload logs. Deployments are automated. Jira or Linear has various information. And last but not least, there's incident management records somewhere, too.
It feels like the hard part should be over. This is just a reporting problem, right? Someone with enough data-fu skills should be able to get all that data into one place, stitch together a few timestamps, write some queries, and put it all in a nice dashboard. There are even products (Swarmia, LinearB, Waydev) out there that attempt to do this with little to no effort on your part. Setup a few integrations and metrics will be inferred from pre-existing data.
But as it turns out collecting DORA metrics is not really a data extraction problem. It’s a modeling problem. More than that, it’s a systems problem that looks like a reporting problem. And the difference between those things shows up almost immediately if you are concerned at all about having an accurate report.
Most implementations start in the same place: git commit history. And that's a reasonable place to start. Git history has the following properties:
And it is a good place to begin. Git can tell you:
What it cannot tell you, at least not reliably, is what actually happened in production.
Code can sit for an undeterminate amount of time. It can:
Git knows about code but it has no record of outcomes. And if you build your metrics entirely from git, you end up measuring activity in a repository and calling it delivery.
Lead time is usually the first place this starts to feel off.
If you’re working from git, it’s very easy to define lead time as the time between first commit and merge. Or maybe from pull request open to pull request merge. Those are real intervals, easy to compute, and can get you a (hopefully) nice looking chart.
They are also incomplete. The part of lead time that actually matters is what happens after the code is done. The time spent waiting in CI. The time lost to failed builds. The time sitting in pre-production stages. The time consumed by release coordination, approvals, and whatever informal process exists to move something into production.
That space between merge and deploy is where a lot of reality lives. If you don’t measure it, you’re not really measuring how long it takes to deliver software. You’re measuring how long it takes to finish writing it.
Deployment frequency has a similar problem, just in reverse. Naive reports often count merges to a main branch and treat that as a proxy for deployments. It feels close enough, especially in smaller systems, but it can break down quickly
Sometimes twenty merges go out in a single deploy. Sometimes one merge triggers deployments across multiple services. Sometimes infrastructure changes follow a completely different path than application code. Once feature flags enter the picture, deploy and release stop meaning the same thing entirely. And don't get me started about having to account for all possible git branching strategies.
You can have a repository that looks very active and a production environment that changes relatively slowly. Or the opposite!
Counting merges gives you a number. It just doesn’t necessarily give you the one you’re looking for.
Change failure rate is where most teams realize this isn’t just a query problem. On paper, it’s simple: what percentage of deployments cause a failure? In practice, it forces you to answer a harder question: what is a failure?
Every organization has slightly different answers. Sometimes those answers aren’t even consistent within the same engineering team.
And then there’s the question of causality.
Sometimes the answer is obvious. Often it isn’t.
At that point a query won't help you. You’re making decisions about how your system behaves and how much uncertainty you’re willing to tolerate in your measurements.
Mean time to restore looks simpler until you try to define the boundaries. There are a lot of questions to ask. Mostly around when:
These questions don’t have universally correct answers. But the answers you choose shape the metric in ways that are easy to overlook.
At some point, the pattern becomes hard to ignore. The problem isn’t that the data doesn’t exist. It’s that it exists in different places/systems, each one describing a different parts of the overall system that you want to measure.
Git tells you what changed. CI tells you what was built. Your deployment system tells you what actually ran in production. Your incident tooling tells you when things broke and when they were fixed.
If you try to derive everything from one of those systems, you end up with a partial view and a lot of guesswork. What works better, at least in my experience, is to define a small set of canonical events and build from there.
Those are facts. They come from different places, but they describe the same high level system. Once you can connect them, even if imperfectly (but honestly), your report can begin to look more accurate.
And there’s a temptation to wait until everything is cleaned up before starting. Perfect metadata. Perfect service ownership. Perfect traceability from commit to deploy to incident.
That day doesn’t tend to arrive on its own. A rough model that is explicit about its assumptions is usually more useful than a polished dashboard that hides them.
If you’re linking incidents to deployments heuristically, say so. If you’re measuring lead time from merge to production because commit-to-production is too ambiguous in your system, say so. If feature flags mean that deploy doesn’t always equal release, say so.
Metrics can handle caveats. They don’t handle false precision very well. Be honest about the data whenever you share it up or down the chain of leadership.
The failure mode I worry about most isn’t that teams get the numbers slightly wrong. It’s that they build something that looks right.
A dashboard that updates regularly. Numbers that trend in plausible directions. Enough complexity to feel credible. Not enough transparency to be questioned (my biggest pet peeve).
If the entire system is built on Git logs, that’s usually what you get. It’s not useless. But it’s not what people think it is.
If there’s a simpler way to say all of this, it’s probably this: Let each system tell the part of the story it actually knows.
Use Git for changes. Use CI for builds. Use your deployment platform for production events. Use your incident tooling for failures and recovery.
Then connect those pieces carefully, document the assumptions, and accept that the edges are going to be a little messy. It’s not as clean as a single query. But it’s a lot closer to the truth.