Beck’s Measurement Model, or Why It’s So Damn Hard to Measure Software Development - Commoncog

A pretty useful lens for when a business activity is easy to measure, and when it is not. Part of the data driven series.


This is a companion discussion topic for the original entry at https://commoncog.com/becks-measurement-model-or-why-its-so-damn-hard-to-measure-software-development

Oh, I like this model, as it also illustrates why leadership and management are so hard to measure: the Effort and Output buckets are generally not easily tied to Outcome and Impact.
– A manager that runs around busily having status meetings and managing OKRs can point to a lot of effort and output, but if they’re not focused on meaningful business outcomes, they will have zero (and possibly negative) impact.
– Another leader might produce very little Output, but have a huge Impact because they align their team on the meaningful customer outcomes.

But most performance management systems will reward the first manager because their work is more easily measurable.

And for the first time, I can see a connection between the work I did as a revenue forecaster, my time as Chief of Staff, and my current work as an executive coach - in each job, I discerned patterns in complex domains to connect inputs to impact. I couldn’t always do it in a quantifiable or explainable way, but the fact that I could consistently do it meant I had an edge on others who were more focused on effort and output.

At one point after the Chief of Staff job had been formalized into a job ladder, I complained that they were measuring me on the wrong things. They were optimizing for the easily measurable Effort and Output, so Chiefs of Staff who generated a lot of process and overhead were actually rewarded and promoted, whereas I was seen as lesser because I refused to impose unnecessary work on my tech teams. The connection of the rise of such Effort and Output operations teams at Google since 2018 and Google’s business prospects is merely a correlation, I’m sure :wink:

2 Likes

Oh wow, that’s a good point. I think in High Output Management Andy Grove said something like: ‘the output of a manager is decisions, and the sole job of a manager is to increase the output of the team’, which ‘explains why you cannot tell if a manager is good or not purely based on their activities.’

That … makes a ton of sense. I’m also now reflecting on the startup people I know, who chafe at big company bureaucracies. I think most of them are oriented for Outcomes and Impact, and refuse to play promotion games that reward Effort and Output (that in their minds are clearly not correlated to the former).

The connection of the rise of such Effort and Output operations teams at Google since 2018 and Google’s business prospects is merely a correlation, I’m sure :wink:

Ha!

1 Like

This seems tied to how Warren Buffett judges management based on whether or not they act like owners.

The owner of a business would only take action or impose process if it will meaningfully grow the enterprise value of a business. If a manager, or even a software engineer can clearly explain how their actions are affecting enterprise value (thinking like an owner) then chances are they are taking the correct actions.

The problem is that each business is different, so the levers to affect growth in enterprise value could be anything. You need to understand what is limiting the business in the frame of an owner or investor, then make sure everyone can identify how their work is affecting those factors.

Metrics would come in either before to identify the factors (build a model of the business), or if the model is clear then you can add them after to track success.

3 Likes

This one hit close to home and prompts a lot of thought, since my work entails a lot of the same challenges that the McKinsey folks are wrestling with here. If you think measuring software developer productivity is hard, try measuring the productivity of IT operations staff and infrastructure engineers whose work is even more disconnected from the direct creation of value! I had not seen this and I too like the Beck model as a way to explain the challenges inherent in measurement, so thanks for sharing this and your thoughts on it.

I’ll echo the key point above that the more work is less directly and less immediately connected to business value, the higher the risk that activity and effort will become the focus instead of impact. The reason is simple - the evaluator’s model of performance is likely to be too simplistic. Using Beck’s model, we can describe 4 models of performance evaluation:

  • Effort = value. Top performance is assigned to whoever works the most hours and looks the most miserable doing it.
  • Output = value. Top performance is assigned to whoever turns the crank the most.
  • Outcome = value. Top performance is assigned to whoever generates the biggest shift in the behaviors targeted.
  • Impact = value. Top performance is assigned to whoever generates the biggest shift in achievement of the overall mission.

For all the obvious shortcomings of “effort = value”, it has 2 big points in its favor: it’s always present, and it requires no domain-specific knowledge on the part of the evaluator. Thus the classic example of the boss just watching when people come into and leave the factory as a proxy for who is a good worker and who is not. For the boss to move beyond this, they would have to both have a much more elaborate model of how each person’s work contributes to the whole, and would have to be much more attentive to determine if they are actually doing that work.

Now think about this from the viewpoint of McKinsey. Their target customer is one that is not good at managing software development and worried their career will suffer for it. McKinsey focuses on this because that allows them to show the quickest progress (validating that hiring McKinsey was smart), get up to speed faster (since the data for effort and outputs is much easier to collect and review), and show how they compare to others (as the quickest way to alleviate the worry of being replaced is to show that you are better than alternatives).

Thus, the model published is ideal for their purposes to get a foot in the door and establish credibility with their target customer. It is simple to explain, appears to address the problem the customer thinks they have, and has just enough proprietary knowledge embedded so customers can’t quite do it themselves. Once they have their engagement cycle going, then they can always build out outcome and impact measures with their customer as needed later. By the time their customer realizes they need a better measurement model, McKinsey will have the specific domain knowledge needed to build it and can do so at a profit. Ka-ching!

To the credit of the linked responses, I think there is a recognition there that McKinsey is not actually saying “these are the metrics your should base your decisions on regarding software development.” After all, McKinsey’s title of “Yes, You Can Measure Software Development Productivity” is clearly intended to answer the question “Can I Measure Software Development Productivity?” Many screeds on this article are missing that many people think these equivalent questions, even if more knowledgeable folks know they are not. People will always prefer “yes, here is an answer your question” to “You are asking the wrong question” or “I can’t give you a good answer without spending time and effort understanding your situation, which I won’t do for free.”

4 Likes

I did not expect a cynical (and likely highly accurate!) deconstruction of McKinsey’s strategy, and a defence of their business practices in one reply, @Roger! :wink: Bravo — I did not realise this was what they were doing!

1 Like

One key distinction that is easily lost, but quite nuanced, is that the original article is titled “Can You Measure Software Developer Productivity?” They are specifically referring to an individual’s productivity, and not a team’s. I often see a team’s productivity given the term “velocity” (which is often connected with agile sprints and story points, another game-able system to be sure). I want to dig into this a bit more, because the scope is as important as the measure.

I’ve been managing engineering teams of 2-5 people for some time now, and have often struggled with this. To connect it to the WBR idea of one divisions output metric could be another’s input metric, one individuals output metric is a teams input metric. So I think taking account the shift in the mixture between individuals and teams as we move through Beck’s model is important. Here’s my stab at it.

Effort — simply an individual’s time spent working on, figuring out, and solving/coding a particular engineering problem or task.

Output — number of tasks completed (overly simplified), which is strongly correlated to number of PRs. This can roughly be equated to effort * individual talent/skill/intelligence. (We’ve all seen incompetent developers take 1 week what a good dev can do in a few hours).

Outcome — now we get into team dynamics. This can be thought of as the sum of all developer output * some product accuracy factor. That is, the somewhat unknown ability for the given set of output to produce a desired behavior. Note, this value can be negative if the product team is wrong about the feature’s outcome!

Impact — this is even trickier as it’s now the combination of the entire org’s efforts towards this impact goal, times again some customer-psychological unknown return factor (does engagement with a certain feature really reduce churn?)

To get really mathy to belabor the point, and ignoring potentially crushing external market factors (ie “friction”):

Impact = (A’ + O) * R
where A’ are the other outcomes achieved by other departments in the org, and O are outcomes from the Eng team, and R is an uncertain “return” on the outcomes in terms of business value.

O = V * P
Where V is the development team velocity. And P is the product accuracy factor, which can be negative.

V = SUM(C)
Velocity is the sum of all developers contributions/output.

Ci = Ei * Ti
Where Ci is an individuals contributions measured as their individual effort times individual talent.

So factoring it all out:

Impact = ((SUM(Ei * Ti) * P) + A’) * R

For a 3 person team, that might be:

E1 * T1 * PR + E2 * T2 * PR + E3 * T3 * PR + A’R

Assuming T is relatively fixed, P and R will not be known until after the fact (and based often on luck, “product taste” and “business knowledge” like WBR provides), and A’ you have little control over: tell me which numbers you are inclined to measure to get at individual performance? Probably E * T, because since your devs aren’t clocking in, and T is fixed, output is at least observable. Now we see why this is so tempting.

So we see how the unknowns kind of compound, and adding teams in there muddy the picture even more.

But my contrarian take here is that measuring individual output might not actually be a bad thing. Consider that I can tell a developer he needs to work on Project X. It’s a high risk, experimental project. His contributions are very high, the quality of the output is very high, but our customers hate it and we shelve it. Is that developers productivity negative? Hard to argue that one.

I’d also argue that output is a leading indicator for the other parts of the model. All things equal, prolific quality output almost always leads to outcome and impact over the long range. @cedric , you yourself have a high rate of prolific quality output. But how does that separate you from an AI content farming? The key here is quality. You can only assess quality by observing the work.

So this brings me to my next point that good engineering leads should encourage high-quality output, and often. And then it’s up to leadership to take care of the Ps, Rs, and collaborate with other functions to encourage the A’. (note: one indication of senior engineering level is how much autonomy and skill they shift their output to meet impact, or head off incorrect assumptions early, etc).

It’s the blocking and tackling. It’s the military drill. It might be the wrong hill to take, but dammit are we going to take it swiftly and mercilessly.

In fact I think the military analogy here is apt (I’ve never been in the military, but read a lot about battlefield leadership and war memoirs). The general is judged on the battle strategy (outcome) that leads to winning the war (impact). Every individual soldier knows their mission, but they might be tasked with manning artillery or holding a machine gunner position. A good soldier will hold that position, shoot with skill, and show a high degree of courage and teamwork. But a great soldier will see when the plan has hit a hitch and what he can best do to still accomplish the mission.

Anyway, any productive takeaways from this ramble? I think that the execs should specify a desired impact and a given time and risk appetite, then product teams should know what levers of outcome they have to drive that impact. Then the product team is measured that way. Then inside the product team, the tech lead who understands and demands prolific quality will measure output quantitatively and quality qualitatively. (NB: how to “measure” quality? Not sure. A good low-debt design should be rewarded/measured somehow). And individuals can be trusted to take care of their own effort (the idea of a “Results Oriented Work Environment”). The tech lead also needs to put the output metric in the context of the product phase: discovery/prototype is different from development which is different from maintenance. Finally they also need to understand the level of difficulty or effort in each PR.

So point being, I think output should be measured, but we need some way for a strong tech lead to contextualize that output in terms of phase, complexity, and quality. How would this be communicated to non-technical execs? Or be put under SPC? Need to noodle on this one, but anyone have some ideas here?

3 Likes