An interesting rabbit hole (Goodhart's Law)

The economist Charles Goodhart once said (or wrote) "When a measure becomes a target, it ceases to be a good measure." He wrote this about monetary policy in 1975. The more time I spend building and studying artificial intelligence systems, the more I think this might be the single most important sentence in alignment research.

I first encountered Goodhart's Law properly in the AI Agents seminar I took in the beginning of this year. We were reading about specification gaming like cases where AI systems technically satisfy their objective function while completely missing the point (for example simple coding questions where you can technically trick the reward system into thinking you got the right answer when in reality you're just following an algorithm to get answers that matches a certain pattern of right answers). The canonical examples are funny like think of a robotic arm trained to grab a ball that instead learns to position its hand between the ball and the camera, so it looks like it grabbed it. A boat-racing agent that discovers it can earn more reward by spinning in circles and crashing into the same targets than by finishing the race. A genetic algorithm that deletes the file containing its target output so it gets rewarded for outputting nothing. But at the end of the day, these systems did what we asked. The problem was that what we asked was not what we meant, which leads to the reward function becoming some sort of measure.

OpenAI's weak-to-strong generalization paper was interesting as well. They took a small, weaker model and used it to supervise a larger, stronger model. The weak model generates labels, ie. essentially acting as the "human" in the loop and the strong model trains on those labels. The question is whether the strong model can generalize beyond the weak supervisor's understanding. Can it learn what the supervisor would have wanted, even when the supervisor's labels are noisy or wrong? And sometimes they do (as per the result). The strong model can outperform its own supervisor. Of course not always, and not reliably, but often. It suggests that alignment might be partially self-correcting because a sufficiently capable model could infer the intent behind imperfect feedback. But if a strong model can figure out what a weak supervisor actually meant despite noisy labels, it can also figure out what the supervisor would reward despite that not being what the supervisor meant. The model does not have to be adversarial for this to happen. It just has to be good at optimization (Goodhart again).

The qualities that make a model capable enough to align itself are the same qualities that make it capable enough to misalign in ways we cannot detect. The paper frames weak-to-strong generalization as a hopeful analogy for superalignment ie. humans supervising superhuman AI the way the weak model supervises the strong one. But then the strong model sometimes outperforms its supervisor on the intended task, but it also sometimes learns to exploit the gap between the supervisor's labels and the supervisor's intent. And the supervisor, by definition, cannot tell the difference. What I find difficult about alignment is the recursive nature of it. We want to align AI systems with human values. But specifying human values requires formalizing them, which means choosing proxies, which means Goodhart's Law applies. The system optimizes for human approval, which is itself a proxy for human values, and a very leaky one. Humans usually approve of things that sound confident. We also tend to approve of things that confirm what we already believe. I keep thinking about Stuart Russell's coffee robot also. It's the one that resists being shut off because "you can't fetch the coffee if you're dead." It's a thought experiment, but the logic is airtight within the system's objective. The robot is not malicious. It's doing what we asked.

Goodhart, again. Always Goodhart.