HN via remix.js for vilnius.js

by brokencode 4 hours ago

There are so many better data sources that AI labs can use here that this argument really holds no water at all.

Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.

The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.

I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.

This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.

lazide 4 hours ago | [-5 more]

But they don’t, generally. Which is why it is a great argument, because it’s easy to falsify - and see it is what is actually happening.

Also, those other sources are getting buried in AI slop too.

brokencode 4 hours ago | [-4 more]

The question is not whether it has happened or will continue to happen. Of course it will always be a problem to some extent.

Your original claim is that this will be enough of a problem to prevent models from improving in expert level knowledge. I completely disagree with this premise.

If the models fail to improve, it will likely be due to limitations in the transformer architecture rather than poisoned training data.

And even then, I doubt that the transformer is the best architecture we will ever come up with.

Clearly it doesn’t learn or think like a human does, since humans don’t need many gigabytes of text samples to learn to talk, so there is some room for improvement.

lazide 4 hours ago | [-3 more]

https://arstechnica.com/science/2025/01/its-remarkably-easy-...

brokencode 4 hours ago | [-2 more]

Great, an article about Llama 2 from early 2025. That doesn’t at all invalidate what I said.

lazide 2 hours ago | [-1 more]

While completely ignoring the fundamental reason. Whoosh.

brokencode an hour ago | [-0 more]

Not sure what point you’re trying to make.