There are two kinds of people in life: those who like crude dualities and those who do not. And now that I’ve unmasked myself as the first, please allow me to introduce a statistical distinction I find helpful: windows vs. scoreboards.
A “window” is a number that offers a glimpse of reality. It does not feed into any incentive scheme. It cannot earn plaudits or incur punishments. It is a rough, partial, imperfect thing—yet still useful to the curious observer. Think of a psychologist asking a subject to rate his happiness on a scale from 1 to 10. This figure is just a crude simplification; only the most hopeless “1” would believe that the number is happiness.
Or imagine you’re a global health researcher. It’s not possible to quantify the physical and mental well-being of every human in a country. Instead, you look to summary statistics: life expectancy, childhood poverty, Pop-Tarts per capita. They’re not the whole reality, but they’re valuable windows into it.
The second kind of metric is a “scoreboard.” It reports a definite, final outcome. It is not a detached observation, but a summary judgment, an incentive scheme, carrying consequences.
Think of the score in a basketball game. Sure, bad teams sometimes beat good ones. But call the score a “flawed metric for team quality,” and people will shoot you the side-eye. You don’t score points to prove your team’s quality; you improve your team’s quality to score more points. The scoreboard isn’t a rough measure, but the desired result itself.
Or consider a salesperson’s total revenue. The higher this number, the better you’ve done your job. End of story.
A single statistic may serve as window or scoreboard, depending on who’s looking. As a teacher, I consider test scores to be windows. They gesture at the truth but can never capture the full scope of mathematical skills (flexibility, ingenuity, affection for “sine” puns, etc.). For students, however, tests are scoreboards. They’re not a noisy indicator of a nebulous long-term outcome. They are the outcome.
Many statistics make valuable windows but dysfunctional scoreboards. For example, heed the tale of British ambulances. In the late 1990s, the UK government instituted a clear metric: the percentage of “immediately life-threatening” calls that paramedics reached within eight minutes. The target: 75%.
Nice window. Awful scoreboard.
First, there was data fudging. Records showed loads of calls answered in seven minutes and 59 seconds; almost none in eight minutes and one second. Worse, it incentivized bizarre behavior. Some crews abandoned their ambulances altogether, riding bicycles through city traffic to meet the eight-minute target. I’d argue that a special-built patient-transporting truck in nine minutes is more useful than a bicycle in eight, but the scoreboard disagreed.