In this month’s issue:
We take a deeper look into the draw of synthetic respondents and ask whether it‘s a pseudo-science in the making.
Some bullshit of your very own in The dunghill.
A fascinating paper that should change the way you work in the The white stuff.
And finally simsets: my new API for generating explainable data.
To start things off …
Synthetic survey respondents: A revolution in research methods or the worst idea ever?
Here’s a money-saving tip for research agencies. Instead of blowing all your cash on an expensive survey panel, spend it on just one person - an actor. Ask the actor to do a series of impersonations (“You are a middle-aged woman of average income living in a rural area.”) and with each impersonation ask the actor to complete the survey. Gather the results and proceed as usual. Be sure though that the list of characters passed to the actor is representative of the market you are interested in. Genius!
You could save even more by sacking the actor and substituting an LLM. Then the whole thing could be achieved programmatically by randomly generating characters for the LLM to impersonate, based on the distribution of demographic characteristics in the population.
This, in essence, is the idea behind synthetic respondents - and yes the reaction of the average person on the street is “Are you out of your mind?”
Inside the marketing community, however, some are already touting synthetic respondents as a possible complete replacement for survey-based research. Why might this be a bad idea?
The actor example seems absurd because we rely on the actor to accurately recreate the opinions of real-life respondents - which seems a stretch. If we are to take synthetic respondents seriously then we need some guarantee that they can faithfully recreate members of a real population. The standard response to this is that they are the product of an algorithm trained on an almost unimaginable amount of data. This is true; however, unlike traditional statistical methods where we know exactly how the size of a data set impacts the certainty of results, there is no theory for understanding how synthetic respondents relate to the training data, nor is there any information about how representative the training data is of real populations. What we need (and don’t have) is a measure of something the authors of this paper call algorithmic fidelity:
We define algorithmic fidelity as the degree to which the complex patterns of relationships between ideas, attitudes, and socio-cultural contexts within a model accurately mirror those within a range of human subpopulations.
Without a direct measure of algorithmic fidelity, we can only fall back on empirical tests such as whether the results of a real survey agree with one that is identical but synthetic - however this means running the real survey, placing us back where we started. But perhaps good fidelity on previously validated studies could be used to guarantee future performance. This might work for the more generic survey questions, but the most interesting questions are usually specific and detailed, and are therefore unlikely to have precedents. Vendors might supply an overall score for validation success (“I mean, it was 95% correlation.”) but that’s no good to us since it bundles the generic (easy to get right but obvious) in with the specific (hard to get right and interesting). My bet is that surveys involving synthetic respondents will perform worst on the questions that matter most.
Finally, I’d just like to note that I successfully polled some synthetic inhabitants of Middle Earth on several important policy questions (you can find the results here). The results make perfect sense, and leave me wondering how in the world of synthetic respondents we can ever be sure we haven’t drifted over into fantasy.
So what about about a more tempered response? Synthetic respondents as a supplement or preliminary to old-fashioned research; or as something entirely different - a thinking tool for help with testing assumptions, generating hypotheses, and starting conversations. I spoke to some level-headed industry experts and they all agreed there was something in this:
My feeling is that there's a place for it sitting in between proper consumer research and personas that have been basically made-up, either explicitly when a group of people sat around a table and made them up, or when an agency has done something very, very limited like an internal staff survey and used it to build profiles. - Neil Charles (Sequence Analytics)
My optimistic view is that the most capable users will use it to get to the 'obvious' answer quickly and then use their resources working on how to be different (they will plan away from the average). - Simeon Duckworth (Melt)
It’s hard to argue with this; I can imagine some useful and interesting tools emerging. The big question, however, is whether sectors and industries not known for being overly cautious will have the self-restraint needed to stick to the valid use-cases.
In fact, there’s so much temptation around synthetic it almost feels like entrapment. For one thing, it is wide open to RDOF (see The Dunghill below). I spoke to Scott Thompson, Insight Director at Publicis Media, who put it perfectly:
I remember one [vendor of synthetic respondents] cheerfully saying that if something goes wrong in “traditional” research, you can’t do anything about it once it’s in the field - but with synthetic respondents if you’ve phrased a question badly you can just keep on asking different questions until you get the answer you want. Which is just… saying the quiet bit out loud! If you’re plugging away until you get the answer you want, what happens if the answer you need isn’t the same thing? I see the issue of data that aligns with expectations going unquestioned, while numbers don’t say what they think the client won’t want to hear will get interrogated to death.
Fast and cheap, there’s a good argument that synthetic respondents are inevitable and we can only do our best to push things in the right direction. My fear is that synthetic respondents could fit easily into that particular category of bad science where the following are true: first, whether the results are correct will never be known unless they are checked; second, there is no incentive (in fact there is often a disincentive) to check the results; third, producing the results has some economic (or status) benefit for both the producer and the recipient.
This happened before in academia (in sociology and psychology as revealed by the replication crisis); it happened with some of the more disreputable techniques for measuring marketing and advertising effectiveness. It could easily happen again. We now have a very brief window before bad behaviours get normalised. So if you have any concerns it’s worth speaking up.
The white stuff
In this month’s dunghill we talk about the Rashomon effect as introduced by Breiman in his well-known Two Cultures paper. Rashomon was a Japanese film in which multiple witnesses to a crime give contradictory testimonies, each of which fit the facts. Breiman saw an analogy with the situation in which many models do an equally good job of explaining the data and yet lead to different conclusions.
A Rashomon set is the set of almost equally accurate models for a particular data set and a particular function class.
This fascinating paper uses the idea of a Rashomon set to reach an intuitive conclusion that has immediate practical implications for anyone working in machine learning. The argument uses two premises:
When many different algorithms perform similarly on a particular problem then this is strongly correlated with a large Rashomon set (established empirically by the authors.)
A large Rashomon set implies the existence of a simpler yet accurate model.
The obvious conclusion: if you are finding many different algorithms (say random forests, logistic regression, gradient-boosted trees) are all producing comparable results then its worth the hunt for something simpler, given that simpler usually means easier to interpret and implement.
I can see this becoming a natural step in the search for a good machine learning solution.
The dunghill
Every bullshitter knows the value of nouns. A noun makes something tenuous feel substantial - it names an object not an action, and if someone has gone to the trouble of creating a noun, then it must be an object worth taking seriously. The nouns that hit the big time are buzzwords. I don’t need to tell you what they are.
So when objectionable things happen in the world of data science it would be good to have our own nouns. Sometimes just being able to say “This is clearly a case of X” will do the trick. It will certainly do a lot better than a long-winded, increasingly desperate explanation to an audience primed for buzzwords.
So here is a short list of nouns and noun phrases that are on your side. Apologies to those of you who have heard them from me many times before. That’s because they are so good.
Researcher degrees of freedom refers to all the little choices a researcher (or their superior) gets to make along the way when conducting a piece of analysis, the choices that don’t seem to matter much individually but add up to one huge chunk of bias. Some examples are: deciding which variables to put into a model; deciding which assumptions to test; deciding which data points are outliers. It is as much a problem in business as it is in academic science. The original paper is here.
The Rashomon effect is a term invented by personal hero Leo Breiman to describe a situation where many different models are equally good at explaining the data. (see The white stuff above) The crime here is to pick the model that suits your needs and pretend that the others do not exist (best done by not even looking for them).
Hark-ing1 is the less famous sibling of p-hacking. It stands for Hypothesising After Results are Known, the problem being that you might shape your hypotheses around random patterns in the data and then point to those patterns as evidence supporting your hypothesis.
Double dipping - not a faux pas at the salad bar, it’s the practice of using the same data to both select and evaluate a model (or train and test an algorithm). No machine learning engineer worth their salt is going to make this mistake but it is not given enough attention by those building statistical models.
The ecological fallacy is the mistake of thinking that group properties apply to individuals. In business its most stubborn representative is the “average customer”. Consider bimodal distributions for quantities like age (not uncommon especially when a product is used by those who do not have children at home or who aren’t of working age). In such cases, the average age - a group property - will be the worst choice for representing a typical individual.
QRPs - yes, we can do acronyms too! It stands for questionable research practices and it’s all of the above, plus the regulars (correlation is not causation, biased samples, etc.) Ask an agency or supplier for details of the checks that are in place to mitigate against QRPs.
The chrysalis effect - A recent discovery but one I love. This is the process by which the ugly initial findings are slowly morphed through QRPs until they emerge as a beautiful, graceful, pleasing set of slides.
Aleatoric and epistemic uncertainty are useful terms not for calling out bad practice but for helping you handle unreasonable requests. Aleatoric uncertainty is irreducible uncertainty. It is always going to be there despite your best efforts. Examples include the uncertainty in your estimate that is due to natural variations in physical processes or to measurement error or to sampling. Epistemic uncertainty is due to lack of knowledge. We don’t know whether we have chosen the right model or whether we have the right data. Epistemic uncertainty can in theory be sorted out, usually at a price. Aleatoric uncertainty is non-negotiable.
There, you are armed with your own bullshit of sorts (only it happens to be true). Use it wisely.
If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
From Coppelia
Simsets
Meanwhile I’ve been working on another kind of synthetic data. A bit more old school but useful I hope. The Simsets web API, which I launched a few weeks ago, generates simulated data for some common scenarios. Why is this useful? Because if you created the data then you know the true parameters of the data generating process - something you will never know in real life.
I find this kind of data enormously useful. It is like having the answer sheet at the back of the book. I first used it to stress test some of the less believable claims that were being made about marketing mix models. I’ve since used it in training material, for testing code and for setting interview questions.
So I thought I’d save you the trouble of producing this data. The API gives you the data and the answers. There are only two endpoints so far but more to follow. Check out:
In both cases to access the data programatically use output_type=json as in this example. Note the json includes the latex for model.
The code is available on github, and there are examples of using the API in Jupyter for the explainable time series and the Simflix viewing data.
Untangling deep learning in python
I was very confused. I drew a concept map and now I’m less confused. Thought I’d pass it on. Circles are python packages.
And here’s the mermaid script for the diagram.
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.
Before you say anything, a gerund is a noun, as in “I hate p-hacking”.
(the middle-earth example is brilliant btw)
I thought the Counting Stuff post on synthetic panel data was a good take: "Since user testing is specifically about being surprised by how users react to a given product, using an LLM to simulate a generalized average experience rather misses the point. It's cheaping out on data collection in a way that undermines the purpose of the data collection."
https://www.counting-stuff.com/the-call-of-llms-is-strong/