Can you comment on mixing the two techniques? It looked like from your forum chart (by the way, the newsletter links definitely worked on me!), you’re using the SPC limit lines to determine if an experiment is statistically significant. In the same way, can you use hypothesis testing (assume there has been no change) to determine if an observational data point is out of band? Perhaps mathematically the limit lines are actually the exact same value as a p value of 5%?!
That said, is that the same concept as “risk” in the quoted sections? Wasn’t sure how to interpret that
I tried to find a meme expressing horror at statistics, but I found this instead.
The two approaches are completely different.
The Experimental method I think you’re quite familiar with — ‘statistical significance’ means “the probability that the observed result from the sample drawn is unlikely to be the result of sampling error alone”, or p < .05
The Observational approach is different, it asks: is this data I’m looking at drawn from one probability distribution or from multiple distributions? The intuition here is that for the vast majority of real world data, routine variation will fall within three sigma units of the average. But if the data you’re looking at is drawn from multiple distributions (e.g. a outside factor has impacted your metric, or the process itself has changed its distribution), then you’re going to get data that falls outside three sigma, and that’s a signal to investigate / a sign that your intervention has successfully changed the process behaviour.
I quite like his illustration of this question — which is basically asking, you have a collection of 50 randomly drawn beads, did the beads come from randomly picking out of one bowl (aka one probability distribution) or randomly picking out of many bowls?
At the end of the day, though, both methods, while different, aim to figure out the same thing, which is “I have done X, does it cause Y?” With the process behaviour chart approach, I’m looking for a change in the process as a result of some action I’ve taken. With the experimental approach, I’m looking for an effect to show up in a population given a change in one of the groups. Both methods can be used in a business, of course, but the statistical concepts for ‘a change has occurred’ is different.
Oh, the ‘risk’ in the quoted sections refer to the risk of a false alarm. Wheeler is trying to justify the extreme conservatism of the process behaviour chart approach, which actually rejects a ton of signal in favour of not giving a false alarm. (i.e. if you see something that breaks one of the three rules, it’s highly likely to be exceptional variation, but just because you don’t see any special points doesn’t mean a change hasn’t occurred. )
Thanks for the clarification. Should have been less lazy and reviewed the previous articles in this series. I understand the statistical difference underpinning the two approaches, now.
I think I’m still getting caught up on the difference between an experiment and a change in process. I first assumed that SPC was more to monitor a process for changes (essentially to keep it in the lane, and be able to respond quickly if out of band). So the method appears reactive instead of proactive.
So my confusion for this chart was that it looks like we’re trying to tell if changing a process has a statistically significant difference:
But still not sure when to use which when trying to improve a given metric. Am I running an experiment or am I changing a process? Is the major difference the existence of a control group?
The point about multiple distributions is a great one. Immediately I start to think about isolating each distribution, but this is likely impossible unless we change one potential source of distribution at a time.
Hmm, what’s the nature of your confusion? This is going to be useful for me, as I’ve been explaining XmR charts to various friends and startup folk recently. (I’m starting to see them as a powerful but useful crutch to get people used to the idea of ‘there’s such a thing as routine and special variation’!)
Some more notes, in the hopes of perhaps resolving more questions:
In manufacturing, yes, process behaviour charts are intended to help keep a process in lane, and to respond quickly if out of band. But it’s actually more accurate to say that XmR charts do two things: a) they allow you to separate signal from noise when observing a metric, and b) they allow you to characterise a process’s behaviour. This more generalised take allows you to use XmR charts in slightly different ways.
In this case I’m not trying to detect exceptional variation. (There is one point of exceptional variation, on Oct 22, so I happen to know that my change has resulted in a real outcome. But even if there wasn’t it can still be useful to just do this ‘place divider and plot the new XmR chart’)
Rather my goal is to characterise this new process, like in Step 2. I’m trying to see how the new process behaviour is different, given that I’m now regularly pasting links to forum topics in the newsletter. Concretely I want to see where the new limit lines end up, where the new average is, and so on. At about 6 new points the limit lines will begin to gel, and at about 10 points they will start to harden. In practice I’ll probably just make a new change at the 6 point mark (assuming I have the time to do so, since that’s near my wedding). This is one of the limitations I discussed in the article — the observational studies approach that XmR charts belong to require you to wait before you make your next change.
This is, incidentally, the PDSA loop in action. You can hopefully see how repeated trial and error cycles like this, with feedback, helps me build an intuition for what works and what doesn’t in moving the needle for various metrics I care about. (Also, keep in mind that we’re running multiple such changes in parallel; this is just one of the levers we’ve found that we can pull).
Yeap! Most of the time in business we’re running ‘trial and error’ cycles, instead of rigorous ‘two group’ experiments. The benefit of ‘two group’ experiments is that they can detect subtle changes relatively quickly, given a large enough sample size. The tradeoff is that setting up an experiment takes a fair bit of work.
On the other hand, trial and error cycles are something we intuitively do — change a thing, wait to see if the change is good or bad. XmR charts just makes this process easier, since it can sometimes be hard to eyeball the data in the weeks after you’ve made the change to figure out if a change is real.
It’s not mentioned anywhere, but this article seems to be about the difference between causal and statistical inference.
Or rather: it’s between the group that wants to make causal conclusions from statistical observations but doesn’t know the general theory, and the group that wants to make causal conclusions from experiments but doesn’t know how their methods relate to the purely observational ones.
I’m trying to think of a reading which is specific to explaining this distinction, but can’t think of one. If you can get a copy, I might recommend just reading Chapter 1 of Causality by Judea Pearl, which explains causal networks and gives the basic tool for unifying experimental and observational inference in a common framework.
Your response was great, and really clarified it. Thanks for putting that all together.
My confusion was just in nomenclature – people in business often us the term “experiment” very loosely. Like, starting a business itself could even be considered an “experiment:” let’s just launch this and see if it sticks.
I understand now we’re talking more about a scientific control-group style experiment.
A follow-up question, how can we track non-linear growth rate in these charts? For example, you have new subscribers as a chart, which is assuming linear growth. For compounding growth, perhaps charting the percentage?