HN via remix.js for vilnius.js

by chmod775 7 hours ago

There's a post every other month where some dude who put nonsense information online celebrates because it actually ended up in some frontier models weights.

If it's easy enough that some randos can do it for fun, what do you think happens when there's commercial interest behind it?

Obviously companies are going try nudging AI towards recommending whatever they're selling. It's a logical extension of SEO - and that's a 100 billion USD industry.

Additionally, if I believed myself to be in some sort of spending - err - AI race, I'd try to poison the data sets of my competitors by putting crap out there for others to ingest.

aspenmartin 6 hours ago | [-2 more]

It's not really a problem. We're out of natural tokens anyway. The future is synthetic verifiable traces (already the way we train coding agents).

maxnevermind 5 hours ago | [-1 more]

> synthetic verifiable traces

What does it mean, Is it like when somebody used some coding agent to develop a feature and later input prompts and a resulting PR can be used for training by a presumption that final PR was a correct implementation of a prompt?

aspenmartin 3 hours ago | [-0 more]

Yea it’s rejection sampling, so you have an agent, you take a verifiable problem (people use lots of different verification signals but say unit tests etc) and have the agent attempt it K times. You accept the trajectories (all context, tool use etc, the entire log) that are positively verified and use these as training examples.

The trick is to find the examples that are just in between too difficult and too easy for the existing agent, these have the strongest training signals

brokencode 4 hours ago | [-6 more]

There are so many better data sources that AI labs can use here that this argument really holds no water at all.

Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.

The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.

I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.

This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.

lazide 4 hours ago | [-5 more]

But they don’t, generally. Which is why it is a great argument, because it’s easy to falsify - and see it is what is actually happening.

Also, those other sources are getting buried in AI slop too.

brokencode 4 hours ago | [-4 more]

The question is not whether it has happened or will continue to happen. Of course it will always be a problem to some extent.

Your original claim is that this will be enough of a problem to prevent models from improving in expert level knowledge. I completely disagree with this premise.

If the models fail to improve, it will likely be due to limitations in the transformer architecture rather than poisoned training data.

And even then, I doubt that the transformer is the best architecture we will ever come up with.

Clearly it doesn’t learn or think like a human does, since humans don’t need many gigabytes of text samples to learn to talk, so there is some room for improvement.

lazide 4 hours ago | [-3 more]

https://arstechnica.com/science/2025/01/its-remarkably-easy-...

brokencode 4 hours ago | [-2 more]

Great, an article about Llama 2 from early 2025. That doesn’t at all invalidate what I said.

lazide 2 hours ago | [-1 more]

While completely ignoring the fundamental reason. Whoosh.

brokencode 2 hours ago | [-0 more]

Not sure what point you’re trying to make.

jurgenaut23 6 hours ago | [-0 more]

Do you have examples of such celebrations?

Shitty-kitty 5 hours ago | [-0 more]

They already are, It has become a real problem in Reddit. Especially with the latest in pseudo-science crap like peptides.