Measurement Changes Behavior

February 2026

Core Idea: The moment you attach consequences to a measurement, people begin optimizing for the measurement itself rather than for the thing it was supposed to represent. The metric goes up while the underlying reality stalls or deteriorates. This dynamic — known as Goodhart’s Law — is not a bug in specific implementations. It is an inherent feature of any system that uses metrics to drive behavior.

In 2001, the British government decided to reduce wait times at National Health Service emergency departments. The target was specific: 95% of patients should be seen within four hours. Hospitals that met the target received funding and recognition. Hospitals that missed it faced scrutiny. So hospitals adapted. Ambulances were held in parking lots before officially admitting patients, so the clock would not start ticking. Patients were moved from emergency to other wards just before the four-hour mark — not because they had been treated, but because the transfer reset the metric. Some patients were placed in corridors instead of emergency rooms, technically not “waiting in A&E.” The metric improved dramatically. The patient experience, by many accounts, did not.

This is not a story about dishonest hospital administrators. It is a story about what happens — inevitably, predictably, almost mechanically — when a measurement is turned into a target. The people inside the system did what the system rewarded. The system just happened to reward the wrong thing.

Charles Goodhart, a British economist advising the Bank of England in the 1970s, crystallized this phenomenon into what we now call Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” It is one of the most important ideas in systems thinking, and understanding it is essential for anyone who manages, evaluates, governs, or tries to improve anything at all.

The Mechanism: Six Steps to Metric Corruption

The pattern unfolds in a remarkably consistent sequence, regardless of the domain.

First, you care about something real but hard to observe directly. You care about learning, not test scores. About health, not lab values. About software quality, not lines of code. About scientific knowledge, not publication counts. The thing you actually value is complex, multidimensional, and resistant to simple measurement.

Second, you find a proxy — a measurable quantity that correlates with the thing you care about. Test scores correlate with learning. Patient outcomes correlate with health. Publication counts correlate with research productivity. The proxy is not the thing itself, but it tracks the thing reasonably well. So far, so good.

Third, you attach incentives to the proxy. You reward high test scores with funding. You reward publication counts with tenure. You reward engagement metrics with algorithmic promotion. The proxy is now a target.

Fourth, people begin optimizing directly for the proxy. Teachers teach to the test. Researchers salami-slice papers (dividing one study into the maximum number of publishable units). Social media creators optimize for outrage because outrage drives engagement. The optimization is rational. It is what the system rewards.

Fifth, the optimization breaks the correlation between proxy and underlying reality. Test scores go up, but learning does not. Publication counts go up, but knowledge per paper goes down. Engagement goes up, but user satisfaction and wellbeing decline. The metric decouples from the thing it was supposed to measure.

Sixth, you are now rewarding behavior that does not produce what you wanted. The system appears to be succeeding — the numbers look great — while the underlying reality stagnates or degrades. And because the metric is the thing everyone is watching, the degradation can persist unnoticed for a long time.

In other words, the act of measuring does not just observe the system. It changes the system. And the change tends to make the measurement less trustworthy over time — precisely because people are paying attention to it.

The Observer Effect in Human Systems

In physics, the observer effect refers to the way that measuring a quantum system inevitably disturbs it. You cannot measure the position of an electron without hitting it with a photon, which changes its momentum. The measurement changes the thing being measured.

Something analogous happens in human systems, but through a different mechanism: not physical disturbance, but behavioral adaptation. Humans are strategic. When we know what is being measured, we adjust our behavior — consciously or unconsciously — to perform well on the measurement. This is not dishonesty (though it can shade into it). It is a basic feature of intelligent agency operating under incentives.

The economist Robert Lucas made this point about economic policy in what became known as the Lucas Critique: statistical relationships observed in the past will not hold once policymakers try to exploit them, because people change their behavior in response to the policy. The act of intervention, based on the measurement, invalidates the measurement.

This creates a deep epistemological problem (a problem about what we can actually know). The more consequential a metric is, the less reliable it becomes. Low-stakes measurements — where nobody cares about the result — tend to be accurate. High-stakes measurements — where careers, funding, and status depend on the outcome — tend to be corrupted. The importance of getting the measurement right is inversely related to the likelihood that the measurement will be right.

Examples Across Domains

Education may be the most widely felt example. We care about learning — genuine understanding, curiosity, the ability to think critically and apply knowledge to new situations. We measure test scores. Once test scores become high-stakes (tied to school funding, teacher evaluations, student advancement), the entire system orients around the test. Curricula narrow to tested subjects. Teaching shifts from deep understanding to pattern recognition and memorization. Students learn to pass the test, which is not the same thing as learning the material. Test scores can rise while actual learning stays flat or even declines.

Academia offers a parallel case. We care about the creation of knowledge. We measure publications, citations, and journal impact factors. Researchers respond rationally: they split findings into the smallest publishable units to maximize publication counts. They form citation rings (informal agreements to cite each other’s work). They pursue trendy topics that generate citations rather than important questions that might not. The file-drawer problem (the tendency to publish positive results and bury negative ones) distorts the literature. The metrics go up. The pace of genuine discovery does not necessarily follow.

Healthcare presents a case with life-or-death stakes. We care about patient health outcomes. We measure procedures performed, patients seen per day, readmission rates. Doctors who are evaluated on patient volume rush through appointments. Surgeons evaluated on surgical mortality rates avoid high-risk patients who need surgery most. Hospitals gaming readmission metrics discharge patients prematurely, then classify returning patients as new admissions. The metrics look good. Patients do not necessarily get healthier.

Software development demonstrates the same pattern in a different domain. We care about value delivered to users. We measure lines of code, story points, sprint velocity, bug counts. Developers respond: verbose code inflates line counts, estimates are sandbagged to make velocity look stable, bugs are reclassified to keep counts down. Some of the best engineering work — simplifying a system, removing unnecessary code, preventing future problems — makes every measured metric look worse. The developer who deletes 500 lines of unnecessary complexity looks less productive than the one who added them.

Social media is perhaps the starkest contemporary example. We care about (or claim to care about) user value and satisfaction. We measure engagement: clicks, time on site, shares, comments. Algorithms optimize for engagement, which turns out to mean optimizing for outrage, conflict, and sensationalism — content that provokes strong reactions and keeps people scrolling. Engagement skyrockets. User wellbeing, informed public discourse, and social trust decline. The metric is maximized. The thing it was supposed to represent is destroyed.

Why This Is Unavoidable

There is a tempting response to all of this: just pick better metrics. And better metrics do help. But the fundamental problem is not that we chose bad proxies. It is that any proxy, once it becomes a target, will be optimized at some expense to the underlying reality.

The reason is structural. If you could measure the underlying thing directly, you would not need a proxy. The fact that you need a proxy means there is a gap between what you measure and what you want. That gap is exploitable. And when incentives are attached, the gap will be exploited — not necessarily through malice, but through the ordinary process of rational actors responding to their environment.

The only variable is severity. Some proxies are more robust (harder to game without also improving the underlying reality). Some incentive structures create weaker gaming pressure. But the dynamic is always present. There is no measurement system that is fully immune to Goodhart effects. Acknowledging this is not cynicism. It is engineering realism.

Mitigation Strategies

The problem cannot be eliminated, but it can be managed. Several strategies reduce the severity of metric corruption, and the best systems use them in combination.

Use multiple metrics simultaneously. It is harder to game several measurements at once than to game one. If you evaluate a school on test scores and student portfolios and graduation rates and student surveys, optimizing all four requires coming much closer to actually educating students. Multi-dimensional evaluation is more robust than single-metric management. The tradeoff is complexity: more metrics means more administrative burden and more ambiguity about overall performance.

Rotate metrics periodically. If what is measured changes over time, gaming strategies have less time to entrench. A sales team measured on new customer acquisition this quarter and on customer retention next quarter cannot fully optimize for either at the other’s expense. The cost is lost comparability — you cannot track a single metric over decades if it keeps changing. But comparability is worth less than you think if the metric is being gamed.

Measure outcomes rather than proxies. Move the measurement as close as possible to what you actually care about. Instead of measuring procedures performed, measure whether patients get healthier. Instead of measuring publications, measure whether the research leads to practical applications or replicable findings. Outcome measurement is harder, more expensive, and often lagging (you have to wait for results). But it is substantially harder to game.

Incorporate human judgment. Some evaluation simply cannot be reduced to metrics without losing the thing that matters. Peer review, qualitative assessment, expert evaluation — these are expensive and inconsistent, but they are harder to game than numbers. The best evaluation systems combine metrics with judgment: the numbers flag where to look; the humans decide what they see.

Lower the stakes. When metrics are low-stakes — informational rather than consequential — gaming pressure is lower. A metric used for learning and diagnosis provokes less distortion than one used for rewards and punishment. Information without high-powered incentives can still guide decisions. It just cannot compel behavior, which is sometimes a feature rather than a bug.

The Meta-Problem

Here is the genuinely uncomfortable part. Attempts to fix Goodhart effects often create new ones.

You add additional metrics to prevent gaming — and people game the combination. You add monitoring to catch the gaming — and people game the monitoring. You add human judgment to supplement the metrics — and people game the judges. Each layer of correction introduces a new surface for optimization.

This does not mean all attempts are futile. Some measurement systems genuinely work better than others. A well-designed multi-metric system with periodic rotation and human oversight will produce less corruption than a single high-stakes number. The improvement is real, even if it is not a complete solution.

But it does mean there is no clean fix. Every measurement system will be gamed. The question is not whether, but how severely. The realistic goal is not perfect measurement but robust-enough measurement — a system that captures enough signal to be useful while degrading slowly enough to remain informative.

What This Means in Practice

Four practical principles follow from understanding Goodhart’s Law.

First, do not worship metrics. They are tools, not truth. Every number on a dashboard is a proxy for something more complex, and the proxy is always imperfect. Treat metrics as one input among many, never as the final word.

Second, watch for decoupling. When the metrics are going up but the felt reality is not improving, you are witnessing Goodhart in action. The gap between what the numbers say and what people on the ground experience is the most reliable diagnostic signal available.

Third, design for robustness, not perfection. Combine multiple metrics, use outcome measures where possible, integrate human judgment, rotate what is measured, and monitor for gaming. No single tactic is sufficient. Layered defense is the only realistic approach.

Fourth, expect gaming. Gaming is not deviance. It is the predictable response of rational agents to incentive structures. If you design a system assuming people will not game it, you have designed a system that will be gamed. Build assuming it happens. Be surprised when it does not.

In other words, the measure is not the thing. The map is not the territory. And the moment you reward people for improving the map, the map will start looking better than the territory warrants. Knowing this does not make it go away. But it keeps you from confusing the scorecard with the game.

How This Was Decoded

Synthesized from economics (Goodhart’s original work on monetary policy targets, the Lucas Critique), organizational behavior (metric gaming in healthcare, education, and corporate management), science studies (publication bias, the replication crisis), and sociology (Campbell’s Law, which independently articulated the same insight). Cross-verified by confirming that the identical measurement-corruption pattern appears in every domain where metrics drive incentives — from emergency room wait times to social media algorithms to academic publishing. The mechanism is universal: measurement plus incentives produces optimization of the measure at the expense of the underlying reality.

Want the compressed, high-density version? Read the agent/research version →