Generality

Wed, Mar 5, 2025

Once upon a time, a group of scientists built a machine learning model to predict whether a radiological scan contained a tumor. They trained it against a random subset of labeled data, tested it against data that had been held back, and determined that it performed well. When they went to test it with real patients, it was useless. Another group of researchers built a model to predict whether TCP packets were malicious or not. They followed a similar experimental setup, and had similar, useless, results. Why had these carefully trained models failed?

Well, as it turns out, what the first team of scientists had actually done was built a model that detects rulers on radiological scans. And the second team had built a model that classifies TCP packets based on whether their MTU matched Windows or Linux’s defaults. Neither of those were what the authors had intended, or what they had thought they were building. The models were utterly lacking in generality. In these cases, the lack of generality was a product of correlations in the training data that weren’t evident to the researchers.

I recently asked an LLM to produce an HTML single-file MP3 player with a particular set of controls (skip buttons, playback speed, etc.). It performed flawlessly. A few days later I asked the same LLM to make some changes to a Rust library I had written. It absolutely face planted.

I am not here to offer any sort of framework for articulating why it was good at one task and poor at another. But rather, I want to observe that it lacked the generality I would expect from something that had been characterized as being very good at programming. LLMs are far more complex than the classifier models I described; they’re not so easily characterized as “actually this is a model that does X instead of Y” nor is their lack a generality so straightforwardly a property of correlations in their training data. Nevertheless, the level of generality they do (or do not) possess is an essential question if we’re interested in understanding these models, or thinking robustly about their uses and impact. However, it’s not a question we’re likely to be able to answer with that level of generality (pardon the pun).

When people discuss the capabilities of these models, there is a tendency to articulate them in terms of human development (e.g., “this model is like a high schooler, that one is like a PhD student”). This is a mistake. We have a bunch of intuitions about the level of generality with which humans can solve problems, and there’s no particular reason to believe those intuitions are relevant to linear algebra (also, no LLM I have interacted with in the last few years has been like any high school student I have ever met).

I believe it is far more useful to discuss these models in terms of what tasks they are useful for, with specificity. Perhaps the LLM I used was good at front-end Javascript and bad at proc-macro Rust. But more likely the set of things it is good and bad at are not quite that easy to characterize (nor do they fit into buckets that are quite so legible to humans). And specificity is important, with humans we can often take for granted that a human who can do task X well can also do task Y, and that presumption does not hold for linear algebra.

What this calls for is dramatically more rigorous evaluation of how these models perform on the specific tasks we’re using them for, not just by LLM creators but also by consumers. This is frustrated somewhat by one of the most common ways that evaluations of machine learning systems miss the mark: data set contamination. A model recalling some input it was trained on is an entirely different proposition from a model that generalizes out of distribution. Unfortunately, effectively no LLMs publish information on their training data sufficient for someone to answer the question, “has my benchmark leaked into the training set?”

Modern LLMs are fascinating and powerful tools, and there are plenty of tasks that they really are useful for (particularly when paired with user experiences that are designed for them). But far too often people talk about their capabilities as if they’re uniform across many different tasks, or as if they have the same relationships between task skills as humans do. That is not correct, and it occludes clear thinking about what these models really are useful for.