In this month’s issue:
Semi-supervised makes the case for toy problems in real life.
We question the information value of LLM output in the dunghill.
The white stuff finds out that clinical research is not entirely free from methodological madness.
Plus a new problem-solving workshop for data scientists, we welcome the rigour of D-spy, and how Quarto has taken over my output.
Semi-supervised
Toy problems are not just for textbooks, although you’d be forgiven for thinking so, given how rarely they appear outside academia. This is a shame since they are perfect for breaking down the more complex and intimidating problems that occur in real life.
I will explain, but first we should clarify what is meant by a toy problem. Most real-life data science problems are hard for reasons we explored in last month’s semi-supervised: that is, they are typically set in partially observable, multi-agent, stochastic, sequential, dynamic, continuous environments. A toy problem strips away the complexity by assuming a world that is simplified to the point of being unrealistic, or constrained in such a way that the complexity is kept out. Every exercise at the end of every textbook chapter is a toy problem. If they were real-world problems they simply would not fit on the page.
AI is full of well-known toy problems from chess and tic-tac-toe to stacking blocks and navigating Wumpus World. As the AI examples show, toy problems are designed to highlight particular aspects of the real world - other aspects can be temporarily put to one side. Toy problems are, if you like, the problem-solving equivalent of controlled experiments.
So how, outside academia, might they be useful? I have three distinct uses for toy problems, but they all involve, as you might expect, getting a handle on complexity.
The real-world problem is too hard, but I think I can solve it for a much simpler scenario. So I start with a simple toy world; solve the problem there, and then, one by one, add in the real-world components. With each addition, I re-solve the problem.1 Sometimes I hit an iteration that cannot be solved, but then at least I understand what makes the problem intractable. Of course this process never makes it all the way to the real-world problem. Long before I get anywhere close, I will have either hit upon the right way to go or satisfied myself that the problem is insoluble.
The real-world problem seems mostly solvable but there is some aspect of it that I just can’t get my head around. This calls for the targeted toy model - the one that brings out just a single feature of the problem and simplifies away the rest. To do this I like to imagine I am writing the textbook exercise version of the problem, designed to focus the student on one thing only.
I have a new algorithm or statistical technique that I am struggling to understand, and it doesn’t help that the problem it is being applied to is itself horribly complex. So instead I give the algorithm significantly easier tasks to perform. Here the toy problem usually involves a simulated data set which I can adjust, slowly adding complexity, until I understand better how the algorithm does what it does and what its strengths and weaknesses are. Sometimes - if I have too much time on my hands - I make a toy version of the algorithm (a much simplified version, built from the ground up) to apply to the toy problem. My favourite example in this vein can be found here.
The obvious next question is: how do we know which features to include in a toy problem and which aspects to simplify? If you’ve been following the last few issues of Glasseye then you might find this less daunting: constructing an ontology will help purge the problem of abstractions and provide a list of entities to select from; a concept-map will clarify how these entities relate to one another and give you some ideas about where to apply simplifications; perhaps best of all the dichotomies described in Russell and Norwig’s AI task environment will suggest simplifications: assume that the environment is discrete not continuous, episodic not continuous etc.
It perhaps goes against the grain to step back from a problem, and you might get a raised eyebrow when your boss looks over your shoulder. But there’s nothing embarrassing about playing with toys. I have got them out for such varied problems as optimising supermarket shelving, detecting biomarkers in saliva, and understanding the spread of COVID in hospitals.
Please do send me your questions and work dilemmas. You can DM me on Substack or email me at simon@coppelia.io.
The white stuff
Of all the forms of human error, the one I find most fascinating is mass delusion, possibly because groups are able, through a combination of peer pressure and feedback mechanisms, to talk themselves into far more extravagant beliefs than the average lonesome individual.2 I know that this group madness happens in my world; I’m not surprised that it happens in academia, but one place I thought must be safe, since the stakes are so high and since faddism is so obviously discouraged, is the world of clinical research. That was until I read The Curious Rise of Randomised Non-Comparative Trials by Pavlos Msaouel in this month’s Significance Magazine. (The magazine is behind a paywall, but content usually becomes free one year after publication. My own contributions can be found here.)
Curious is right. Usually the abuse of statistics in science is quite understandable. The subject is counterintuitive, often badly taught, and marred by methodological fudges. But this particular abuse is not due to confusion. It is simple and flagrant. So simple that it can be summarised in a few lines.
The standard approach in clinical research is the randomised controlled trial (RCT). Participants are randomly allocated to groups (or arms), each of which receives a different treatment. Randomisation is crucial since it breaks the causal link between the trial outcome and any factor other than the treatment.
Sometimes, when RCTs are impossible or unethical, a single arm trial is preferred. A single group receives the experimental treatment and the outcome is compared to historical data on that same group. This is an observational study not a controlled experiment. Some attempts can be made to handle confounding factors through modelling or matching but the results are generally less reliable than those obtained by an RCT.
RNCTs, which Msaouel says are on the rise, are a nonsensical hybrid. They involve multiple arms to which participants are randomly allocated, “but instead of comparing the outcomes in these randomised arms, RNCTs compare each arm individually to historical or external data.”3 Why then randomise? Msaouel can find no satisfying rational explanation. In fact, “all the safeguards provided by randomisation have been cast aside – and all that remains is the use of the talismanic word ‘randomisation’.” When he runs a workshop for oncologists, the participants are surprisingly candid about their motivations: “We want the aura of an RCT… but we won’t do a formal comparison. Instead, we’ll compare each arm’s results separately against historical data. This lets us call it a ‘randomised trial’ without the heavy burden of a large sample size or the risk that a formal comparison might yield a ‘negative’ result.”4
Conjuring up the aura of a scientific method when the reality is anything but - this is not uncommon in business, and particularly so in media, advertising and marketing. But here at least no one is getting sick or dying. Let’s hope that Msaouel’s article has shamed it out of existence in clinical research.
The dunghill
Karl Popper must be one of the most professionally useful of philosophers. He is famous for his criterion for distinguishing science from pseudoscience (to be scientific a hypothesis must be falsifiable) and for his account of how science works (by putting forward conjectures and then ruthlessly attacking them). Even if both views have been shown to be overly simplistic they nevertheless work as biblical commandments for scientists. And great is the temptation to sin. No one wants to knock down an idea they have spent long hours building up; no business wants to challenge the premise on which it was built. All I can say is that it helps to think of Popper (not a happy-looking man) standing over your shoulder, shaking his head.
Recently though a specific implication of these theories has seemed especially relevant. If we broadly agree with Popper on how science works then it obvious that not all findings are of equal scientific value.
First, refutations are worth more than confirmations. At any one time we theorise against a background of accepted, orthodox hypotheses. A confirmation of one of these hypotheses is worth very little. (So it turns out people buy more ice-cream in the summer. Hooray.) Refutations by contrast are intrinsically surprising. In fact the informational value of a refutation might be gauged by how many of the background hypotheses it knocks down.
Second, of the conjectures that withstand criticism, we should prefer those that Popper describes as “bold”, leading to “novel predictions”. The most valuable are again those that do the most damage to the orthodox body of scientific opinion. The hypothesis of descent through natural selection is an obvious example.
In both cases the background theories render the result (refutation, conjecture holding its ground) improbable, and as every good student of information theory knows, the confirmation of an improbable event provides the most information.
So why am I bringing this up now? Because I think it is useful for sifting the good from the bad among the opportunities currently offered by LLMs. Those that are selling these opportunities tend to downplay two important facts about LLM output: (a) its usefulness is less apparent when measured by its information content (as Popper pointed out, bland confirmations are worthless alongside earth-shattering refutations) and (b) the more valuable (information rich) the content, the more certainty matters, since surprising, counterintuitive results usually require changes, and changes cost time and money.
And so when the cheerleaders for LLM-powered synthetic respondents tell us that there is an 88% correlation with answers given by real respondents, I want to know what that correlation looks like on just those findings that were surprising, and therefore of high informational value. And even if synthetic respondents were shown to be capable of delivering surprising, counterintuitive findings, I would want to know how, if not by an real survey, the certainty called for by surprising results would be achieved.
Similarly when it comes to the average LLM prompt response, I can’t help noticing how much hedging is built in. The output seems very low-grade in terms of surprisal, bland by comparison to human output (or at least the best of it).5 But again, even if this could be fixed, we still run into the same old bind - the more interesting the information, the more risk associated with it, and therefore the more certainty we want. And if hallucinations are intrinsic to LLMs (and I’m pretty convinced they are) then that certainty has to come from elsewhere.
If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
From Coppelia
If you’ve enjoyed the recent semi supervised posts on solving problems that are outside the mould then you might be interested to know that Coppelia now runs a three day workshop, Real-world problem solving for data scientists. It is particularly aimed at those who have an academic background in data science, or a related discipline, but are puzzled about how to apply this knowledge to their day-job. Do mail/message me if you are interested.
I’ve been spending some time this month getting to know DSpy (thanks to
for putting me on to this). Very interesting. It’s an attempt to place LLMs inside the framework of supervised learning. The most magical part is the way it treats the prompt as a kind of tuneable hyper parameter. I particularly welcome the opportunity to pit LLMs against other algorithms, and get a fair assessment of what they are good at.I’ve noticed that, since it was recommended to me by
and Andrie de Vries, Quarto has been gradually taking over my document output. Almost everything I produce that goes to a client (other than code) now goes via this superb package. And as yet no one seems to mind that it’s not Word or Powerpoint, contains full prose, references etc. Raper’s Rule #231: if you treat your client like an idiot they are sure to do the same for you.
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.
If you do a lot of coding, this will be familiar as a standard debugging procedure.
My all time favourite book on the subject is the neglected classic When Prophesy Fails (the study which introduced the term cognitive dissonance).
Pavlos Msaouel, Bad stats: The curious rise of randomised non-comparative trials, Significance, Volume 22, Issue 3, May 2025, Pages 40–44,
Msaouel is paraphrasing the participants (see the original article for more details).
My rule of thumb for trusting LLM output is in fact blandness-based. I trust most those facts which I reckon to be fairly well-known and therefore all over the training data.