Not so long since, the UK tested a nuclear missile. Obviously, as this is the UK, it didn’t work.
No, it popped out of the submarine, whereupon the rocket motor failed to ignite and the damn thing splashed harmlessly in the sea after a few hundred metres, presumably to audible hoots of derision from the Kremlin and probably some choice language from the defence minister flown in to observe the test too. It later transpired that the crew had forgotten to remove some equipment that disabled the rocket motor before the test began.
This vintage England episode (to borrow a phrase from my other brother) built upon the other world-beating embarrasment of our last test in 2016, where the missile, thanks to a data fault, veered so far off course it had to be destroyed.
The missile in question, the submarine-based US-built Trident II D-5, is the sole delivery system for the UK nuclear deterrent. The US and UK both maintain a stock of these missiles drawn from a shared pool, but operate, maintain, and test them independently.
Over Trident’s lifetime, the two nations have conducted a total of 196 sea launches, 191 of which have been successful, for a mere 2.6% failure rate. Of these 196 test launches, the UK has conducted only 12, the first 10 successfully, and the last two failures.
Naturally this incident produced a raft of punditry, including this piece, from William Alberque at the International Institute for Strategic Studies, which argues that these failures, while embarrassing, are ultimately inconsequential, thanks to the long and successful overall test history; either this is just a string of bad luck, or exposing deficiencies in the UK’s testing processes which should have no effect on combat operations.
I’m interested in whether or not this argument is correct, and I’m especially interested in how one arrives at these conclusions. Specifically, I wonder why, outside of scientific journals, nobody ever seems to think much about using the (dead easy) statistical tools and tests carefully developed for this kind of task. My goal is not to exhaustively break down every single variation of analysing this simple statistical problem, but simply to highlight that we have ways of doing this. We don’t have to guess.
Testing
In this example, you’d probably start by simply splitting up the launches into a simple US/UK success/failure grid. Doing so tells you the US has performed 181 successful and 3 failed launches, while the UK has managed 10 successful and 2 failed launches.
If I were a pundit, my analysis would typically stop here, and I would make a conclusion. I might say that the overall failure rate is low, and the missiles all come from the same pool, so the missiles probably work fine. I might say that the UK has failed a lot more than the US, and therefore we’re much worse at firing missiles. I might even say that the US has had three failed launches, and the UK has had two, so the UK’s still ahead of the US (and I rather suspect the amount of people who would not notice that horrific fallacy, especially if dressed up less obviously, is frankly worrying).
(Actually, if I were a real pundit, I’d have decided the conclusion before I’d even written the piece, but I digress).
In this example, the difficulty is the fact that the UK has launched so few missiles. If this was scaled up, and the UK had launched 200 with 30 failures, you’d know something was wrong; but what about in such a small number of launches? How do you compare it to the much larger number of launches in the US?
In this case, you’d use the wonderful little bits of magic called statistical tests. There are endless statistical tests, for endless different situations, but happily, knowing only half-a-dozen of the most common ones would cover almost all of the situations you’d need them for. Which test to use generally depends on whether your data is discrete or continuous, binary or ordinal, if it follows the normal distribution or not (if there are no obvious biases, it generally will), and if the sample size is small or not (a large sample size usually meaning a group of 30 or more).
Here, you’d use a test called Fisher’s exact. The mathematical details, while not terribly complex (apart from maybe the next couple of sentences), are not important here; what is important is Fisher’s tests for association between two categorical variables of a small sample size. The test assumes the UK and US are equally good at firing missiles, and then calculate, under this assumption, the probability that the data produced looks like it does.
In simple terms: imagine that you wrote “success” on 191 bits of paper and “failure” on 5, put them all in a hat, and then pulled out, at random, just 12, to represent the UK’s share of these launches. Frankly, it seems like you’d be quite unlikely to pull out even one failure, and pulling out two would definitely be unlucky. All Fisher’s exact does is the long, arduous sum that you’d need to do to work out the actual chance that you’d draw two failures in your 12 attempts.
I didn’t know this off the top of my head; I got ChatGPT to tell me which test to use (and manually confirmed on Wikipedia), then did it on an internet calculator, which, after less than 30 seconds of work, gave me this:
The test value (or p-value), 0.0311 (3.11%), indicates that if the US and UK were equally proficient at launching these missiles, then there’s only a 3% chance that the UK would have had so many failures by now, or a 97% chance that the two groups are in fact not equally good at launching them**.
(Note that this is distinctly different to “there is a 97% chance that the UK is worse at launching its missiles” and is definitely different to “the UK is 97% worse at launching missiles than the US”, two common misinterpretations.)
Conclusions
Our first conclusion, then, is probably the opposite of Alberque’s; it’s likely that we have a higher failure rate launching our missiles. Especially when combined with the seemingly increasingly serious problems with the UK’s submarine force, it seems pretty likely the UK’s competence in using its nuclear deterrent is probably dropping, and fast. I happen to think that’s an important problem, and something that, as a nation, we shouldn’t sweep under the rug.
There are other considerations; it seems that at least some of the failures have been caused by incorrectly following test procedures that wouldn’t pose a problem in combat (though, again, doesn’t speak well as to the competence of the crew). It’s hard to know that one way or the other, due to the obvious sensitivity of the subject. Likewise, it should come as no surprise, given the doctrine of MAD that governs our deterrents, that the government’s line is that these failures are inconsequential and that the deterrent works; to admit failure of the deterrent rather defeats the object.
Defencewise, then, I think another test sometime in the near future is strategically warranted. These failures might also signal that we’re probably running Trident too hard, and more investment might be required. If another launch were to fail, meanwhile, then I don’t think you could declare the deterrent operational.
That said, my point with this piece is not really to argue about the UK’s nuclear deterrent; Ultimately, the crux of this issue is if the UK is worse at launching its missiles than the US, and trying to compare the two things in this case without using the test is actually a bit nonsensical.
It’s like someone trying to perform 485 x 593 without knowing how to multiply, so (given it would be far too complex to try and add up 485 593 times manually), they just give up and take a wild (and probably totally inaccurate) guess, based on how they feel about the numbers (as you yell from the sidelines that there’s a magic operation that will just do it for them).
To be clear, this test (or, indeed, any test) is not the last word in accuracy, nor does it tell you everything you need to know about the subject. It represents nothing more than a best-guess based on one way of looking at the data. In theory, you could go further; you could split the data by time, or include missile tests on the ground, or try and generate some confidence intervals (to tell you how precise that estimate of 3.1% really is).
I’m not doing this, partially because the sample size would get too small, but mostly because I don’t think I need to. I don’t need a massively accurate number; what I wanted was a ballpark figure, some that wasn’t a total, wild guess. In this case, it’s very valuable, in that it leads to a pretty clear-cut conclusion that’s totally different to the original author’s.
I spent lots of my time in my biomedicine degree studying statistics, and for good reason. Numbers, and patterns within those numbers are surprisingly unintuitive, and simply guessing based on the simplest of summary statistics is, outside of the most clear-cut circumstances, a recipe for erroneous conclusions.
So much of life – not just journalism, or biomedicine, but life in general – consists of making guesses about probabilities. Sophisticated techniques to stop guesswork being guesswork have been around for centuries, are dead simple to use…and yet nobody ever does.
These basic tests have an extraordinary amount of potential uses, and are, once you’re used to figuring out which one to pick, no more difficult than putting the right numbers into a calculator. I may write more (for brevity’s sake if nothing else) in a separate post, about how to choose what test to use and when in more real-world terms; but for now, remember if you’re trying to compare two groups, you’re almost certainly a single Google search from a much better answer than you might have thought possible.
Oh, and we should probably test Trident again.
*The result being significant at p<0.05 (or 5%) just means that there’s a less than 5% chance that the observation is due to chance, which is the standard cutoff for “statistical significance” in medicine.
*Note I didn’t say “The UK is worse”; this “two-tailed” test checks in both directions (ie if the UK is better or worse), which does affect the values somewhat. You can correct this, but one-tailed tests are less common, I didn’t bother loading a calculator for a one-tail, and I’ve just finished writing this, so I’m not doing it. Sue me.