This is a story that long-time readers of Commonplace would know.
In 2010, IARPA (Intelligence Advanced Research Projects Activity — basically, ARPA, but for the intelligence community) — started a research initiative called the Aggregate Contingent Estimation Program. ACE was a competition to determine the best ways to ‘elicit, weight, and combine the judgment of many intelligence analysts’, and it was motivated by IARPA’s desire to know how much money, exactly, they were wasting on geopolitical analysis. In the beginning, it consisted of five competing research programs. Within two years, only one survived: this was Phillip Tetlock, Barbara Mellers and Don Moore’s Good Judgment Project.
One of the many analogies between humans and reinforcement learning is in this line. Noisy data prevents many RL/ML models from making useful predictions, no matter how advanced the algorithm. Some would say that there’s no real way to decouple algorithms from the data you feed them. But talking about them separately for a second, it’s interesting to note that one secret to AIs making useful prediction (denoising) is also the secret to people doing the same.
I was just about to say something to that effect! One of the more interesting side effects of writing about human decision making research is that I now have an excuse to talk with my ML friends. Except that, of course, they say things like “oh wow the math here is so crude” and “wow this is like a stupid version of what we do in ML”.
A couple of caveats about the BIN model, that I threw up as a Twitter thread yesterday.
I briefly considered not covering the math of the model in this piece, but then decided it was important enough for mathematically inclined readers. I wanted to highlight that what the authors call ‘bias, information, and noise’ are really statistical artefacts — not ‘actual’ bias, information or noise. There’s a bit of a risk of ‘the map is not the territory’ here; the model may well not reflect reality.
Second, the BIN model wasn’t tested against a new dataset; it was run against the original GJP’s dataset — which is a few years old at this point. This is always a bit of a risk — it might mean that the authors found exactly what they were looking for, because they were looking for patterns in old data. That said, in the ‘future work’ section of the paper, it seems that they’re aware of this, and they do seem eager to try out new experiments. The GJP2 also happens to be ongoing.
I quite like what @shawn said to me, after reading this post: “This is not the same as Physics, where you assume a model, test it, and then you find that reality doesn’t match the model. With this paper, the authors are assuming their model is correct, and then using it to deduce a latent variable that they can’t measure directly. So the certainty with which everything is written around this is quite uncomfortable for me.” I think this is a damn good point. It would make me feel more comfortable if the researchers could predict some implication of this model, and then verify it through experimentation.
So, all of this is to say: what the BIN model argues is interesting, but whether it captures what’s really going on under the hood for the GJP interventions, well I think that remains to be seen. To be fair, there’s a bit of an Occam’s Razor thing going on — the paper’s explanation is simple, and plausible — but we’ll probably have to wait a couple more years to know if they got it right. If you’d ask me to make a prediction as to whether the result is legit, I’d place it at 0.75 — fairly confident, but not at, say, 0.9.
But that’s how science works. So let’s wait and see.
Ditto. The authors have framed this in such a way that GJP conditions = reality, so a proper test/replication of this result is going to be hard.
To sum up pages 6 and 7 at a high level: The authors assume a normal distribution of the latent variables, which gives them recourse to the literature on the topic, so they can use Ravishanker and Dey’s results (near the bottom of page 6) to generate equation (1). I don’t fault them here; without an assumption about the underlying distribution, there’s no equation to be had: the problem would be intractable. (And maybe it is, in reality. We’ll see, eventually.)
They then go on to say that since they have constructed a tournament using Brier scores, the participants will be motivated to reduce conscious bias to a minimum, which gives them the reduced form of the equation on page 7.
The individual assumptions in the paper are plausible given the conditions under which the data was collected (otherwise the paper wouldn’t have been published) but as far as application outside that context goes, the results themselves may be noise.
From my instrumentalist perspective, the best thing to come from this paper is the skepticism about earlier work on the subject:
The results place qualifications on past portrayals of top performers (“superforecasters”) in both the scientific literature (Mellers et al. 2015) and in popular books (Tetlock and Gardner 2016)
This moved “Superforecasters” up my to-read list, because now it’s clear I won’t have to work as hard as I would have earlier to tease out the distinction between cognitive bias reduction and narrative convenience, the effects of portraiture.
“Superforecasting,” duh. I wrote the above comment about as fast as I could type it, so if it’s confusing, let me know.
As a practical de-noising alternative to the BIN model, I second the ‘How to Build a Reasoned Rule’ approach put forth by Kahneman et. al. as a sidebar in the HBR article Cedric linked in the OP. A set of reasoned rules can be built in a spreadsheet, if one has a clear sense of what factors are important to a process. No model training required.
A reasoned rule set is effectively a quantified checklist that spits out a number for each instance of inputs that can be rank ordered against other instances.
The kicker, for Commoncog purposes: Tacit knowledge is extremely valuable when constructing reasoned rules. It’s what makes them useful. For instance, I’d take a set of reasoned rules built by one of Cedric’s notional ‘Chinese businessmen’ who knows the ins and outs of brokerage plumbing over whatever it was Robinhood was running last week when they effectively halted trading of ‘meme stocks’ to cope with collateral issues.