I seriously doubt that data set poisoning will be a real limiter in model performance.
For one, if your website/book is poisoned, who is going to trust it for anything at all, much less for training models?
For two, all the major AI labs hire or contract for subject matter experts to create curated data sets, evaluate model performance, etc.
Unless they hire malicious experts, this will provide a growing, high quality data set that should drown out any poisoned pretraining data.
There's a post every other month where some dude who put nonsense information online celebrates because it actually ended up in some frontier models weights.
If it's easy enough that some randos can do it for fun, what do you think happens when there's commercial interest behind it?
Obviously companies are going try nudging AI towards recommending whatever they're selling. It's a logical extension of SEO - and that's a 100 billion USD industry.
Additionally, if I believed myself to be in some sort of spending - err - AI race, I'd try to poison the data sets of my competitors by putting crap out there for others to ingest.
It's not really a problem. We're out of natural tokens anyway. The future is synthetic verifiable traces (already the way we train coding agents).
> synthetic verifiable traces
What does it mean, Is it like when somebody used some coding agent to develop a feature and later input prompts and a resulting PR can be used for training by a presumption that final PR was a correct implementation of a prompt?
Yea it’s rejection sampling, so you have an agent, you take a verifiable problem (people use lots of different verification signals but say unit tests etc) and have the agent attempt it K times. You accept the trajectories (all context, tool use etc, the entire log) that are positively verified and use these as training examples.
The trick is to find the examples that are just in between too difficult and too easy for the existing agent, these have the strongest training signals
There are so many better data sources that AI labs can use here that this argument really holds no water at all.
Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.
The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.
I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.
This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.
But they don’t, generally. Which is why it is a great argument, because it’s easy to falsify - and see it is what is actually happening.
Also, those other sources are getting buried in AI slop too.
The question is not whether it has happened or will continue to happen. Of course it will always be a problem to some extent.
Your original claim is that this will be enough of a problem to prevent models from improving in expert level knowledge. I completely disagree with this premise.
If the models fail to improve, it will likely be due to limitations in the transformer architecture rather than poisoned training data.
And even then, I doubt that the transformer is the best architecture we will ever come up with.
Clearly it doesn’t learn or think like a human does, since humans don’t need many gigabytes of text samples to learn to talk, so there is some room for improvement.
https://arstechnica.com/science/2025/01/its-remarkably-easy-...
Great, an article about Llama 2 from early 2025. That doesn’t at all invalidate what I said.
While completely ignoring the fundamental reason. Whoosh.
Not sure what point you’re trying to make.
Do you have examples of such celebrations?
They already are, It has become a real problem in Reddit. Especially with the latest in pseudo-science crap like peptides.
I think you underestimate just how much money is being poured into LLM SEO at the moment. It's real quiet because they don't want to draw attention and countermeasures from the frontier labs, but this is getting huge investment, and they will have a monomaniac focus on juicing product results whereas the attention of the labs necessarily has to be spread out.
Data curation is important and expensive and frontier labs can afford to do it right. Natural data isn't the limitation, we are already literally out of tokens. It doesn't matter how much you poison things it's not going to stop the progress train.
Who's doing llm seo right now? How does that work when you only gets feedback every few months when a new model is out?
I'm pretty sure the Optimization part is just ... not present at all.
This is how we get LLM summaries presenting something mentioned once by some nutjob in a reddit thread as bona fide FACT
Look at G2.com - they found their website is highly references by AIs and they are leaning into it hard.
Pretty easy to display one thing to verified browsers (just latest few user-agents from the 10ish different mainstream browsers on the 3 main OSes) and another to anything else.
Yes AI scrapers can easily spoof user-agent, but they fall out of date as the browser updates.
Bit harder to catch them in tarpits and then serve nonsense to whoever ever triggered the tarpit.
>Yes AI scrapers can easily spoof user-agent, but they fall out of date as the browser updates.
It’s a hell of a lot easier for a company to ensure that its scrapers all report the latest user agent string than it is to get everyone and their mother to update their browsers in a timely fashion.
yeah but unless everyone is checking the version, if it's just a handful of websites checking it, they don't.
and browsers forcibly auto-update