The question is not whether it has happened or will continue to happen. Of course it will always be a problem to some extent.
Your original claim is that this will be enough of a problem to prevent models from improving in expert level knowledge. I completely disagree with this premise.
If the models fail to improve, it will likely be due to limitations in the transformer architecture rather than poisoned training data.
And even then, I doubt that the transformer is the best architecture we will ever come up with.
Clearly it doesn’t learn or think like a human does, since humans don’t need many gigabytes of text samples to learn to talk, so there is some room for improvement.
https://arstechnica.com/science/2025/01/its-remarkably-easy-...
Great, an article about Llama 2 from early 2025. That doesn’t at all invalidate what I said.
While completely ignoring the fundamental reason. Whoosh.
Not sure what point you’re trying to make.