Discussion about this post

User's avatar
Dominic Bates's avatar

Some interesting topics! I think most online models (e.g. ChatGPT) are fairly detectable because we specifically train them not to respond like normal humans, e.g. in the finetuning / RLHF steps, and we also almost never sample from the probability distributions completely randomly (I.e. temperature = 1, top_p = 1)

I did quite a bit of research in this area a couple of years ago (admittedly a bit out of date now) but I think if you took a large base model without fine-tuning steps, and sampled from the softmax completely randomly, a lot of these measured stylistic differences might dissappear (especially around lower variation and word choice). Although I guess most people are probably just using default ChatGPT model and settings so perhaps we don't need to worry too much!

Dominic Bates's avatar

It was a literature review primarily, so I only did small amounts of playing around with building detectors myself, but did read almost all the literature that was around then. Most studies were using ChatGPT output or similar so will have included the fine-tuning RLHF bits, but most didn't go in to much detail in to their dataset creation and didn't compare models so no particularly robust results.

Yeah I also suspect we prefer some kind of simplicity, which is reinforced by the RLHF and fine-tuning steps. But no evidence to support this.

Yeah good question. I would suspect it's just a slightly dodgy black box detector that was trained on a limited dataset? So works fairly well generally, but occasionally the probabilities are wildly wrong for models / topics / writing styles that are not part of the training data (at least that was the case 2 years ago). Must be almost impossible to get a complete like-for-like human / AI dataset with all possible styles, models, sampling parameters etc.

3 more comments...

No posts

Ready for more?