In this month’s issue:
Semi supervised gets strict and demands that you tidy your room!
We worry about the Orwellian implications of generative AI in the white stuff.
The dunghill pursues a hunch that something is not quite right with AI-driven sample size boosting.
Plus its out with the old and in with the new as polars, cursor and mkdocs, nudge my old tools from the nest!
Semi-supervised
Last month, I started a series of posts on solving the kind of problems in statistics and data science that do not fit a standard template, and the advice was to define your ontology. This month we look at a complementary approach that I call “tidying the room” after a quote from Wittgenstein (no apology for pretentiousness!)
In philosophy we are not, like the scientist, building a house nor are we even laying the foundations of a house. We are merely ‘tidying up a room’.
What Wittgenstein meant was that many of the most difficult problems in philosophy can be solved simply by paying close attention to our use of language. If we straighten out the meaning of certain key concepts then these puzzles will simply disappear. This was quite an extreme view (which has since fallen out of fashion). More recently Daniel Dennett made a similar but less controversial claim when he said: “We philosophers have a taste for working on the questions that need to be straightened out before they can be answered.”
What I’m saying is that our work in data science and statistics requires a significant amount of room tidying before any building gets underway. Why us in particular? Because very often a data science project is the first time a business concept has required a precise definition.1 Before that moment, terms like “subscriber”, “lifetime”, “visit”, “touchpoint”, “lapsed”, “cost-per-acquisition”, etc. have meant what each person has wanted them to mean (the root cause of many a pointless meeting). If you forget to tidy the room before you get to work then you will end up building something on top of this chaos and that will please no one.
And very often the room tidying is the work or at least a substantial part of it. I have worked on more than one project where just clarifying the concepts solved the issue, just as Wittgenstein wanted it to do for philosophy. This means that to be a statistician or data scientist you’ve got to have, as Dennett puts it, a taste for this kind of work. If you like your problems delivered in neat little parcels then it probably isn’t for you.
So what strategies can I suggest for room tidying? Here are a few:
Define your terms, and I mean really define them: i.e. in such a way that they are non-tautologous and have strict boundaries. Is a customer someone who has just visited the site, or do they need to have made a purchase? When do they cease to be a customer? After how many months of inactivity? Is a customer in the real world the same as a customer on the database? Evidently not, if a single person can have more than one account. Most people are unaware of quite how difficult this activity is. (If you think any of the terms I listed above are self-evident then you’ve never tried it!) Clearly this is a task made much easier if you have in place an ontology.
Beware of dates and time windows - they are a particularly potent source of confusion. For some reason, time seems to get forgotten when defining our concepts.2 We treat things as though they are either permanent or happen in an instant. Time windows (e.g. a customer lifetime, the duration of an ad campaign etc) are particularly treacherous because each pair has four possible relationships (A contains B, B contains A, A overlaps the beginning of B, B overlaps the beginning of A). Make sure you think about them all.
Once you have a passable first draft of your key concepts, start to think about the relationships between them. There are many useful tools for doing this. Diagrams are indispensable. I suggest concept maps, causal diagrams, fishbone diagrams, Venn diagrams. Think about the relationships between measurements. If one grows should the other grow proportionally or should it tail off, or grow exponentially? Remember this is all a priori work, concept clarification - you have not even touched the data yet!
Above all bring the business (or client) with you. It is their room you are tidying, so there should be an ongoing conversation in which you test your latest revision of a definition against their knowledge of the reality. Never start from zero. A common sense definition should be your base. I can guarantee that if you approach a business person with “What is an X?” then you will come away empty-handed, but if you ask “Is X a Y or a Z” then you will get somewhere. Use plenty of examples to refine definitions or illustrate incompatible scenarios.
Do all of this, and do it in a friendly way, without condescension or finger-pointing, and you will be off to a very good start. The client will be reassured that you care about the actual problem and you will know what you are talking about.
Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.
The white stuff
Will human-produced art and entertainment - music, novels, movies, poetry - soon become a luxury product, only for the connoisseur? Will the rest of us be fed on mass-produced, machine-generated dross? Will media consumption soon be divided between the cultural equivalents of a farmer’s market and a budget supermarket? These are the questions hinted at towards the end of ChatGPT’s Poetry is Incompetent and Banal. This short but fascinating paper is itself a response to AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably, in which the authors ran two sets of experiments to determine how poetry produced by ChatGPT compares to the real thing. In the first they asked their respondents to pick out genuine poems by famous poets from among ChatGPT-produced imitations; in the second they asked their respondents to rate a selection of ChatGPT and human-produced poems without knowing which was which. The result was that in the identification exercise, “participants performed below chance levels in identifying AI-generated poems (46.6% accuracy, χ2(1, N = 16,340) = 75.13, p < 0.0001)”, and that in the preference experiment they overwhelmingly preferred the AI-generated poems.
It’s a nice irony that the person leaping to the defence of poetry in the second paper, Ernest Davis, is a professor of computer science at New York University, while the champions of ChatGPT are professors within the humanities; even better, while the latter run and analyse a set of statistical experiments, Professor Davis makes his points using textual criticism. You might think I’d be team stats, but no I agree with Davis, the ChatGPT efforts are undoubtedly crap. As he puts it:
All in all, the AI poems seem like imitations that might have been produced by a supremely untalented poet who had never read any of the poems he was tasked with imitating, but had read a one-sentence summary of what they were like.
But then why did the results in the first paper come back as they did? In short, Davis’s answer is that most people don’t understand poetry. They don’t expect it to be “difficult and spiky”. On top of this they have low expectations of AI output and so mistake the weirdness of real poetry for AI imperfections. Thus we cannot conclude from these results that there is no difference, qualitatively, between human-generated and AI-generated poetry. Davis argues that:
one could formulate reasonable, measurable, psychological and linguistic criteria under which the real poem is hands down more sophisticated, richer, thought-provoking, deeper, etc. But a preference for the cheery, shallow AI poem may be perfectly reasonable.
This is uncomfortable territory. On the one hand might feel elitist to claim that most people just don’t get poetry, but at the same time it feels reassuring that, with a bit of effort to take us beyond the banal, humans can still comfortably outrun LLMs.
Davis ends his paper by referring us to a similar experiment conducted by the literary critic I. A. Richards in the 1920s and described in detail by George Orwell. In this version, students were presented with rarely seen poems by major poets and bad poems by minor poets and asked to evaluate them. The results were of course that supposed “lovers of poetry have no more notion of distinguishing between a good poem and a bad one than a dog has of arithmetic.” This leads Davis to conclude:
I also think it is a safe bet that the idea that, one hundred years later, scientists would write that drivel generated by an automaton is “indistinguishable” from Shakespeare and Whitman would not have occurred to I.A. Richards in his darkest dreams, and would have occurred to Orwell only in his darkest dreams.
Orwell’s darkest dreams famously took shape in 1984, where novels and song lyrics were written by machines “for the benefit of the proles”. Is that where we are heading? Is that what the first paper is pointing to? I suspect things are not that serious. In selecting poetry the authors have picked a genre which, let’s face it, most people either dislike or are indifferent to. Had they picked film or music or fiction, matching their participants to genres they genuinely cared about, then I think the results would have looked different. Still, in all the discussion about AI rising to approach human intelligence, not much is said about the possibility that, numbed by its output, we might meet it halfway.
The dunghill
This month’s dunghill is based on a hunch. Once again it’s market research in the frame but this time the topic is AI-driven sample size boosting. This is quite different from the synthetic respondents issue we covered back in the July issue. There we questioned the wisdom of using LLMs as a surrogate for real world respondents. There’s no suggestion here that LLMs are involved. Rather, or so it is claimed, techniques borrowed from image processing are used to “amplify the patterns in the data”.
Let’s look at what sample size boosting says it can do. We’ll pick on Toluna as they seem to be making a lot of noise, but there are a lot of competitors offering similar services:
Toluna HarmonAIze Boost, the first product in the new suite, empowers clients to conduct deep-dive analysis on small or niche subgroups where collecting enough real-world data would traditionally be time-consuming or impossible. By amplifying patterns in existing data, Toluna HarmonAIze Boost helps unlock valuable insights without the need for additional data collection.
Now I’m currently at the self-doubting stage where it still seems inconceivable that so many businesses would have invested so much money in something that doesn’t do what it says it does. But still I have a vague feeling that all is not as it should be. At a high level this is based on the following:
I can’t yet find any rigorous academic research to explain or back up the big ideas behind sample size boosting.
I’m pretty sure that where information is concerned there can be no free lunch, a small data set is a small data set.
I know that everyone is under enormous pressure to produce AI-driven tools.
But I also have some more detailed concerns. I’m going to stay on Toluna as I went to the trouble of watching the introductory video to HarmonAIze Boost. There we learn that sample-size boosting will unlock insights from small subsets within our data - hard-to-reach groups, or people we didn’t know we would be interested in (you might already be screaming QRPs!) These groups can be as small as 50 and they will be magnified to around a thousand, after which they can be fed into familiar analytical processes such as regression models for key driver analysis, clustering algorithms and factor analysis.
To help explain the sample-size boosting process the presenter used the example of digital image upscaling, a process which uses various statistical and machine learning-based algorithms to enlarge digital photos while avoiding pixelation. Just as you can pinch and zoom on a digital image, the presenter explained, so sample-size boosting allows you to pinch and zoom on your low-resolution (small sample) data set. The impression is that those “niche subgroups” that sample size boosting will allow us to zoom in on are like heavily pixelated figures in the background of a photo. I’m not sure this analogy works. If a pinky-brown pixel stands in an image where a person’s face should be (and would have been with a higher resolution camera) then no amount of modelling is going to bring back the face. In fact it’s more likely to be the opposite - the pixel will be erased as noise. However the de-pixelation is done (smoothing, VAEs, CNNs) the principle is the same. A model of some kind is fitted, which hopefully preserves the signal and throws away noise - you get a sharper but more basic shape. What the process does not do is reveal some previously hidden intricate details.
So much for the analogy, but then perhaps it failed to capture all the cleverness. What about the actual process used? The presenter mentions that a Gaussian copula model is being fitted and that their model was inspired by work in medical imaging. From that I assume she is referring to something like the process described in this paper. If you read closely you will see that here a Gaussian copula model is being used as a feature extraction process - it just so happens that data is being sampled from the learnt Gaussian copula model and fed into the classifier, which is a rather unorthodox method for passing learnt features to a classifier. Could someone within the market research sector have read this paper and concluded that this simulation step instead of extracting simpler more stable features in fact created more detail?
As every good machine learning engineer knows feature extraction is part of the classifier. It is not a magical step that creates a better data set prior to the learning. But I’m worried that this is what Toluna and their competitors are doing: learning a model from the original data, using that model to generate more data, and then suggesting we feed this data into more traditional models.
Why is that a problem? Well, it’s not if you bolt together the Gaussian copula model and factor analysis and call it a dimension reduction tool, or the Gaussian copula model and regression and call it a regressor in the machine-learning sense. But if you think you are somehow doing factor analysis, or regression in the classical sense (where statistical inference is used to estimate parameters) on an expanded sample, with all the joy that a higher N will bring to the certainty around your parameters, then you are sadly mistaken.
Could this all be a misunderstanding? The large sample coming out of the Gaussian copula model despite its size contains less detail - that’s the whole point: it’s not a sample from the original population, it’s a sample from a model fitted to the data, designed to bring out its broad features. That’s why in the paper I cited above it leads to more stable predictions. But researchers are raised on the mantra that more sample equals more detail. Not in this case.
As I said at the beginning, I’m going to be a coward at this point and hedge wildly. I’ve only just started looking at this. If I’m wrong please educate me.
If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
From Coppelia
Out with the old, in with the new
The toolbox has had a spring clean this month with three very successful substitutions:
Polars for pandas: I wholeheartedly recommend this one. Pandas has served us well but it is time. It’s a particular joy to be free from panda indexes (no more
reset_index
), but I’m also enjoying the tidyverse style chaining and the overall simpler syntax.Cursor for vscode: I was a bit late to the party here, but I’m glad I showed up. As everyone says, the collaboration with LLMs is almost frictionless. You can feel your brain changing!
Mkdocs for sphinx: I wanted to go from python docstrings to package documentation written in markdown. Sphinx, my go-to, seemed to struggle which led me to mkdocs. Fantastic, with lots of marketplace add-ons.
New year, new skills
Use your training budget effectively this year. Instead of spending it on generic, web-based courses that are forgotten the moment they’re over, let me prepare something specific for you and your team, based on problems you face right now. I cover just about any topic in AI, data science and statistics. See here (although a bit out of date) and here for some ideas.
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.
I do accept that rigorous definitions are needed to construct databases, applications, etc. but once that has happened, a maddeningly imprecise language grows up around them.
To give a particularly extreme example, I recently reviewed documentation for a model optimising the flow of traffic through a network in which time was not mentioned once!
That's a very interesting point Dom. I've noticed that as soon as I realise that something is AI generated I'm more inclined to discard it even when its not art. For example, with notebookLM which ought to be, and in many ways is, a brilliant tool, I find myself ignoring the content even though it is genuinely useful.
Thanks Fred!