Goodhart's Law Isn't as Useful as You Might Think - Commoncog

Goodhart’s Law is a famous adage that goes “when a measure becomes a target, it ceases to be a good measure.” If you’re not familiar with the adage, you can go read all about its history on Wikipedia, and perhaps also read the related entry on the ‘cobra effect’ (which also includes a litany of entertaining perverse incentive stories, of which the eponymous cobra anecdote is merely one):


This is a companion discussion topic for the original entry at https://commoncog.com/goodharts-law-not-useful/

I really appreciated this piece, as designing good metrics is a problem I think about in my day job a lot. My approach to thinking about this is similar in a lot of ways, but my thought process for getting there is different enough that I wanted to throw it out there as food for thought.

One school of thought I have trained in is that metrics are useful to people in 4 ways:

  • Direct activities to achieve goals
  • Intervene in trends that are having negative impacts
  • Justify that a particular course of action is warranted
  • Validate that a decision that was made was warranted

My interpretation of Goodhart’s Law has always centered more around duration of metrics for these purposes. The chief warning is that regardless of the metric used, sooner or later it will become useless as a decision aid. I often work with people who think about metrics as a “do it right the first time, so you won’t have to ever worry about it again”. This is the wrong mentality, and Goodhart’s Law is a useful way to reach many folks with this mindset.

The implication is that the goal is not to find the “right” metrics, but to instead find the most useful metrics to support the decisions that are most critical at the moment. After all, once you pick a metric, 1 of 3 things will happen:

  • The metric will improve until it reaches a point where you are not improving it anymore, at which point it provides no more new information.
  • The metric doesn’t improve at all, which means you’ve picked something you aren’t capable of influencing and is therefore useless.
  • The metric gets worse, which means there is feedback that swamps whatever you are doing to improve it.

Thus, if we are using metrics to improve decision making, we’re always going to need to replace metrics with new ones relevant to our goals. If we are going to have to do that anyway, we might as well be regularly assessing our metrics for ones that serve our purposes more effectively. Thus, a regular cadence of reviewing the metrics used, deprecating ones that are no longer useful, and introducing new metrics that are relevant to the decisions now at hand, is crucial for ongoing success.

One other important point to make is that for many people, the purpose of metrics is not to make things better. It is instead to show that they are doing a good job and that to persuade others to do what they want. Metrics that show this are useful, and those that don’t are not. In this case, of course, a metric may indeed be useful “forever” if it serves these ends. The implication is that some level of psychological safety is needed for metric use to be more aligned with supporting the mission and less aligned with making people look good.

2 Likes

Good god, this is a rather deep, rather profound point. I think one of the things I’m beginning to have some appreciation for (but clearly not enough, for what you just said is new to me) is that there are … tricky psychological side effects around the use of metrics, especially if connected to incentives. Or, if I can be more precise: I think I realise the effects around metrics connected to economic incentives; I do not fully appreciate the effect of metrics when connected to social ones.

2 Likes

To preface: I’m new the WBR and the Statistical Process Control principles behind them, so forgive my naïveté here. Based on your description of the WBR, I find it interesting that the relationships between input and output metrics appear to be determined by “feel”. This seems at risk of often mistaking correlation for causation.

Is there a more rigorous and empirical process for determining input-output metric causality? For instance, by running A/B tests to try to measure the effects of known changes to input metrics on the output metric.

1 Like

Hi @petro — this is an excellent question, and I say this partially because I had the exact same reaction when I started working on the WBR project.

I think the exact question I asked was a more typical marketing attribution question, something like “say one input metric is number of newsletter subscribers to Amazon deal emails, and another is ad spend, how do you attribute a lift in sales in some e-commerce category to ad spend vs deals”. And the answer I got from Colin was something along the lines of you want a rough sense of causation but you don’t need precision — which in this case is impossible anyway.

I’ll give a concise answer to your question first, before following it up with a few points.

The short answer is: yeah, for certain types of questions, they do it ‘by feel’, as you put it. :grimacing: Except that they would object strongly to that characterisation! What they would say is that the causal model in their heads is the result of a deep qualitative understanding of the customer. On top of that, (and this is my interpretation): you can get verification of causality by taking action to generate information — in many of these cases, you could just “ok let’s drive it or let it go for a few weeks and see if the output metric improves/declines”.

But, yes:

This philosophy and the need to practice it (a relentless focus on free cash flow) successfully drove the creation of other capabilities, such as Amazon’s robust, extremely accurate unit economic model. This tool allows folks like the merchants, finance analysts, and optimization modelers (known at Amazon as quant-heads) to understand how different buying decisions, process flows, fulfillment paths, and demand scenarios would affect a product’s contribution profit. This, in turn, gives Amazon the ability to understand how changes in these variables would impact FCF. Very few retailers have this in-depth financial view of their products; thus, they have a difficult job making decisions and building processes that optimize the economics. Amazon uses this knowledge to do things like determine the number of warehouses they need and where they should be placed, quickly assess and respond to vendor offers, accurately measure inventory margin health, calculate to the penny the cost of holding a unit of inventory over a specified period of time, and much more.

The long and short of it, I think, is that for certain business things, you really can’t do it any other way, and attempting to get at a precise understanding of correlation vs causation is counter productive. But for other things, where precision is possible and good, they do go after it.

There are two other pieces that I think might be useful to think about:

First, a huge chunk of Amazon’s approach is to use proper process control tools. One big one is that metrics owners must have the ability to recognise routine vs exceptional variation. I was working on a piece that explains how SPC does this, but couldn’t get something I was pleased with before the Chinese New Year holidays hit. Expect that members-only essay in two week’s time.

Second, Colin told me he believed very strongly in ‘understanding your customer first, qualitatively, before forming a hypothesis and verifying with data’. He did not like the approach of developing an understanding of the customer from data or surveys. My understanding of this is that this is an adaptation of the scientific method — you don’t want to come up with a conclusion from some pattern in pre-existing data; you want to come up with a hypothesis and then test it — or at least test it on some other data set.

I apologise for being a bit over the place in my response to this. As is the case with many things in business, Amazon’s approach to data is a combination of judgment, a deep understanding of the customer, and numerical rigour, and the mix was what I wanted to learn from Colin in the months that I worked for him; I’m not entirely sure I understand it fully or I’m able to articulate it quite yet.

This was a great read. Goodhart’s Law always triggered a “so what” response from me. It definitely describes a phenomenon but isn’t very actionable.

Separating between input metrics and output metrics was something tha makes a lot of sense. Glad to read about the nuances of how it was applied at Amazon (I’m also reading Working Backwards, so this thread could be another instantiation of concepts that I can refer to).

Having read and attempting to apply the North Star Framework from Amplitude, I think it was a good attempt to encapsulate the relationships between input/output metrics, but also not nuanced enough in some aspects.

Before that, in Amplitude’s framework, a NSM is a leading indicator of business success, so I don’t find the part of this critique where it’s an output metric to be particularly interesting? Are there more nuances between leading/lagging and input/output?

But a valid critique is that the framework does make people think that there’s only one or a few metrics to work on. Reading the framework in context of product management, I understand that a NSM (or a set of NSMs) is a result of a clearly formulated product strategy. It seems reasonable that a product strategy should focus on a few critical aspects of the business, hence its resulting small number of metrics also reflects this characteristic. Perhaps 50 output metrics were all aligned with Amazon’s product strategy? 50 does seem a lot, but still perfectly reasonable for a complex business.

A possibility is that product strategy is just what I’m focusing on as a PM, but it exists as a layer among many other strategies that are going on at the same time within a company. But at the same time, “product” is a fuzzy concept and any anything can be a product. Just throwing thoughts out there.

Another thing I wonder is whether there’s some useful patterns that describe some dimensions when thinking about input metrics. For example, in Amplitude’s NSM, a heuristic use to break a NSM into multiple Input Metrics is breadth, depth, frequency and efficiency.

A high volume e-commerce business like Instacart could have a NSM: “total monthly items received on time by customers”. Using the breadth, depth frequency and efficiency heuristic, the inputs for this north star metric could be:

  • Breadth. Number of customers placing orders each month.
  • Depth. Number of items within an order
  • Frequency. number of orders completed per customer each month.
  • Efficiency. Percentage order delivered on time.

I personally have tried to use it and found it useful in providing dimensions to think about when crafting metrics. But I haven’t criticized enough, so appreciate any thoughts on this. @petro @cedric

1 Like

Following up on this topic in a new private topic, because I want to tell a few stories that I don’t want to be public yet.