Glasseye
Issue 18: October 2025
In this month’s issue:
Semi-supervised explains why AI is not coming for your job (as long as you are doing it properly).
A sober assessment of AI agents in the white stuff.
The dunghill deplores the unreasonable demands on human classifiers.
Plus lovely tmux, the purity of todo.txt and another helping of survey slop.
Semi-supervised
I’m going to stick my neck out and say that you are not going to lose your job to an LLM-powered coding agent. As this has been the worry of more than one colleague over the last month, I will back this up with an argument. If I’m wrong, I’m very sorry.
It’s a two-part argument: (1) Coding agents are currently, and for the foreseeable future, awful at doing anything new. (2) If you are good at your job - data scientist or statistician - you are always doing something new.
For the first part of the argument, so that you’re not just taking my word for it, I’m enlisting the help of OpenAI founder and Silicon Valley insider Andrej Karpathy. He’s no AI doomster, so, if at all, he should be biased in the other direction.
But in a recent (extremely interesting) interview, he said the words that so many of us were repeating silently to ourselves: “This is a damn fine auto-complete tool, great for boilerplate code and prototyping, but I’m not sure it’s good for much else.” Ok, these aren’t his exact words, but if anything he’s even more blunt. (If you don’t believe me, listen to this segment.) He also expresses frustration at the claims being made on behalf of LLM-powered agents (this is from the person who coined the term ‘vibe-coding’). This time I will quote him verbatim: “They’re just cognitively lacking, and it’s just not working. And I just think that it will take about a decade to work through all those issues.”
He makes three further comments which I can’t help but love him for:
First, “I think it’s kind of annoying to type out what I want in English, because it’s just too much typing.” Oh yes.
Second, “A lot of times, the value that I brought to the company was telling them not to use AI. I was the AI expert, and they described the problem, and I said, don’t use AI.” Oh but they listen to you Andrej!
And third, have a listen to his take on the operating model for Waymo cars, in particular his strong suspicion that sitting behind the apparent autonomy, is a control room full of human telemetric operators.
But back to the argument. Karpathy’s observation is that the performance of LLM coding tools falls off when you are writing atypical code. (“They are not very good at code that has not been written before”) This certainly chimes with my experience, and that of many of my colleagues. In fact, it seems the secret to efficient use of such tools is to develop a sense for when you are straying into well-mapped territory (LLM autocomplete on) and when you are on the fringes (LLM autocomplete off). As I’ve said before, the single greatest efficiency hack is to set up a “shut up” shortcut key.
The second part of my argument rests on experience: when it comes to problem-solving in data science and statistics, in nearly three decades of graft, I’ve seen very few repeatable patterns. Problems can be similar, but never the same, and each difference requires a great deal of thought. In other words, each solution is a new solution. To those who refuse to believe me, I point to the shipwrecks of companies that tried to sell data science products, going right back to Autonomy in the early 2000s. I’m not talking about tools that will help a data scientist do their job, but rather off-the-shelf solutions that claim to automate away the data scientist. Whenever one appears, I do a little poking, and pretty soon I find the equivalent of Karpathy’s telemetric control centre - a roomful of STEM graduates, wondering why they are never mentioned in company presentations.
I don’t know exactly why reusable solutions are so rare in our line of work. If I had to guess, then it would be something to do with the fact that we typically solve problems that involve complex systems (businesses, supply chains, customer cycles), rather than components within systems, and these systems are always themselves unique.
So I have faith in the irreducible uniqueness of such problems, and from that I conclude that you and I will be fine. That said, if you spend your days regurgitating other people’s code, ignoring the specifics of your client’s problem, and shoehorning briefs into solutions that don’t fit, then you’d better look for another job. But I would have said that anyway.
Please do send me your questions and work dilemmas. You can DM me on Substack or email me at simon@coppelia
The dunghill
The way to foil every act of terrorism is to incarcerate everyone. The way to prevent every case of domestic child abuse is to place every child in care. The way to catch every potentially fatal disease before it is too late is to monitor everyone, all of the time. Obviously this is stupid. But if we know this, then why, in each of these areas, is there so much moral indignation when anything less than 100 percent is achieved?
In terminology that will be familiar to most of you, the examples above obtained complete recall at the cost of abysmal precision. Recall is calculated as true positives/(true positives + false negatives). If we assume every citizen is a terrorist, then we will have no false negatives and the recall will be equal to 100 percent. Precision is calculated as true positives/(true positives + false positives). Since the vast majority of the population are not terrorists, the number of false positives will be enormous, and precision will be close to zero. I’m not telling you anything you don’t know already.
But then you are lucky. You have had years of training machine classifiers. You know, in particular, that for any classification problem two kinds of improvement are available:
There’s the expensive kind: you improve the performance of the classifier itself - that is, you make it better at predicting the probability that any given individual is A or not A. This should improve both the recall and precision, or at the very least improve one without penalising the other.
And the cheap kind: doing nothing to improve the performance of the classifier itself, and accepting the predicted probabilities as they are, you raise or lower the threshold probability for a ‘Yes’. Raise it to one, and you have the situation we opened with: 100 percent recall. Lower it to zero, and you have 100 percent precision.
So what does this tell you about classification out in the real (much nastier than your notebook) world? First, the unscrupulous will try to sell us the cheap option, as though it were the expensive one. “The solutions are simple: lock everyone up, send everyone home; what’s all the fuss about?” Of course, here the chickens of abysmal precision will eventually come home to roost.
Second, there is in many cases a hard limit on how good the classifier - and now I mean a person or a process - can be. How can you know for sure whether someone will reoffend, or commit a lone wolf terror attack? The best practices, the greatest diligence, will take you up to this limit, but nothing short of supernatural powers will take you over it. Once that limit is reached, the only improvement available is the cheap kind: raise recall or precision, but always one at the expense of the other.
This means there are good, justifiable, thoughtful criticisms of classification failures, which ask why people or processes weren’t operating close to the limit of what is possible. But there are also petty and unreasonable criticisms, which blame people who could not have possibly done more. Once again, the cheap trick here is to pretend that it was always easy, and that perfect recall is achievable without cost to precision.
If you have some particularly noxious bullshit that you would like to share, then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
The white stuff
As you’ve probably guessed by now, I’m not at the hawkish end of the Great AI Debate. Last month I made a passing comment slagging off the buzzword agentic. To make amends, and prove that I’m not a total Luddite, this month I buried my head in Al Agents vs. Agentic Al: A Conceptual taxonomy, applications and challenges. It’s a well written and level headed paper, and earned my respect almost immediately by doing some desperately needed room-tidying on the agentic concept. The authors do this by differentiating between AI agents, “defined as autonomous software entities engineered for goal-directed task execution within bounded digital environments”, and agentic AI, an “emerging class of systems [that] extends the capabilities of traditional AI Agents by enabling multiple intelligent entities to collaboratively pursue goals through structured communication, shared memory, and dynamic role assignment.” This is very useful distinction. Unfortunately no-one else seems to be up for it.
All in all the paper is a pretty sober account of both AI agents and agentic AI, providing a sensible review of the opportunities and an exhaustive, unsparing list of the challenges. Perhaps, like me, you’ve been worn down by the relentless concept creep and need reminding that there might be something in the idea; or perhaps you are an AI evangelist, puzzled by all the frowning going on. Either way you’d do well to read this paper - and maybe shake hands in the middle.
From Coppelia
I’m now into month three of my retreat from GUI-land to the safety of the terminal and I’ve finally figured out why I’m doing it. It’s simple, probably obvious, but it didn’t hit me until now: in a world of pure text those who wish to distract me with their buttons, images and sounds have been stripped of their powers. No wonder it’s such a peaceful place. My happy discoveries this month have been tmux (there’s something beautiful about the minimalist, bar-less panels into which it divides my screen) and todo.txt - the severest, most stripped down todo list I’ve yet to encounter.
I’m feeling partially vindicated this month in my crusade against synthetic respondents (AKA survey slop). Two papers have been published that voice similar concerns: The threat of analytic flexibility in using large language models to simulate human data: A call to attention and The Limits of Synthetic Samples in Survey Research. The second paper makes exactly the point I made here, about the likely failure of synthetic respondents when it comes to non-obvious insights: “While LLM-generated “synthetic samples” can approximate real-world population proportions on frequently asked and highly polarized poll questions, such as Donald Trump’s approval rating (LLM error was around 4 percentage points), LLMs badly predicted the public’s attitudes on less polarized and novel survey questions.” Unfortunately this point was missed in a further paper: LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings. No effort is made to separate out surprising (and therefore information-rich) findings from the bleeding obvious, and if we can’t see that, then we don’t know how well the technique is doing in precisely the area in which it would be useful. Many thanks to those who forwarded me the papers!
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.





For the record, and since it has been pointed out, the mention of shipwrecks and Autonomy in the same sentence was just an unfortunate co-incidence. I wouldn't joke about that.
Regarding the article, I found the point about coding agents being awful at doing anything new realy thought-provoking. It makes me consider the precise definition of 'new' in the realm of complex problem framing for a data scientist versus just novel execution, which feels like a crucial distinction for the future of our profession.