by lazide 9 hours ago

I don’t think they will improve, there is too much incentive to poison the datasets going forward.

A lot of the models up to this point have been benefitted - like Google did - from essentially ‘pre SEO’ internet.

Now the same tools are being used to generate nigh infinite good sounding bullshit, which poisons the dataset in all sorts of hard to detect ways.

To add insult to injury, the human experts are also not as. Naive, and have many incentives to poison their own input in subtle ways too.

brokencode 8 hours ago | [-21 more]

I seriously doubt that data set poisoning will be a real limiter in model performance.

For one, if your website/book is poisoned, who is going to trust it for anything at all, much less for training models?

For two, all the major AI labs hire or contract for subject matter experts to create curated data sets, evaluate model performance, etc.

Unless they hire malicious experts, this will provide a growing, high quality data set that should drown out any poisoned pretraining data.

chmod775 7 hours ago | [-12 more]

There's a post every other month where some dude who put nonsense information online celebrates because it actually ended up in some frontier models weights.

If it's easy enough that some randos can do it for fun, what do you think happens when there's commercial interest behind it?

Obviously companies are going try nudging AI towards recommending whatever they're selling. It's a logical extension of SEO - and that's a 100 billion USD industry.

Additionally, if I believed myself to be in some sort of spending - err - AI race, I'd try to poison the data sets of my competitors by putting crap out there for others to ingest.

aspenmartin 6 hours ago | [-2 more]

It's not really a problem. We're out of natural tokens anyway. The future is synthetic verifiable traces (already the way we train coding agents).

maxnevermind 5 hours ago | [-1 more]

> synthetic verifiable traces

What does it mean, Is it like when somebody used some coding agent to develop a feature and later input prompts and a resulting PR can be used for training by a presumption that final PR was a correct implementation of a prompt?

aspenmartin 3 hours ago | [-0 more]

Yea it’s rejection sampling, so you have an agent, you take a verifiable problem (people use lots of different verification signals but say unit tests etc) and have the agent attempt it K times. You accept the trajectories (all context, tool use etc, the entire log) that are positively verified and use these as training examples.

The trick is to find the examples that are just in between too difficult and too easy for the existing agent, these have the strongest training signals

brokencode 4 hours ago | [-6 more]

There are so many better data sources that AI labs can use here that this argument really holds no water at all.

Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.

The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.

I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.

This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.

lazide 4 hours ago | [-5 more]

But they don’t, generally. Which is why it is a great argument, because it’s easy to falsify - and see it is what is actually happening.

Also, those other sources are getting buried in AI slop too.

brokencode 4 hours ago | [-4 more]

The question is not whether it has happened or will continue to happen. Of course it will always be a problem to some extent.

Your original claim is that this will be enough of a problem to prevent models from improving in expert level knowledge. I completely disagree with this premise.

If the models fail to improve, it will likely be due to limitations in the transformer architecture rather than poisoned training data.

And even then, I doubt that the transformer is the best architecture we will ever come up with.

Clearly it doesn’t learn or think like a human does, since humans don’t need many gigabytes of text samples to learn to talk, so there is some room for improvement.

lazide 4 hours ago | [-3 more]
brokencode 4 hours ago | [-2 more]

Great, an article about Llama 2 from early 2025. That doesn’t at all invalidate what I said.

lazide 2 hours ago | [-1 more]

While completely ignoring the fundamental reason. Whoosh.

brokencode 2 hours ago | [-0 more]

Not sure what point you’re trying to make.

jurgenaut23 6 hours ago | [-0 more]

Do you have examples of such celebrations?

Shitty-kitty 5 hours ago | [-0 more]

They already are, It has become a real problem in Reddit. Especially with the latest in pseudo-science crap like peptides.

Analemma_ 8 hours ago | [-4 more]

I think you underestimate just how much money is being poured into LLM SEO at the moment. It's real quiet because they don't want to draw attention and countermeasures from the frontier labs, but this is getting huge investment, and they will have a monomaniac focus on juicing product results whereas the attention of the labs necessarily has to be spread out.

aspenmartin 6 hours ago | [-0 more]

Data curation is important and expensive and frontier labs can afford to do it right. Natural data isn't the limitation, we are already literally out of tokens. It doesn't matter how much you poison things it's not going to stop the progress train.

tayo42 7 hours ago | [-2 more]

Who's doing llm seo right now? How does that work when you only gets feedback every few months when a new model is out?

natebc 6 hours ago | [-0 more]

I'm pretty sure the Optimization part is just ... not present at all.

This is how we get LLM summaries presenting something mentioned once by some nutjob in a reddit thread as bona fide FACT

DougN7 6 hours ago | [-0 more]

Look at G2.com - they found their website is highly references by AIs and they are leaning into it hard.

microgpt 8 hours ago | [-2 more]

Pretty easy to display one thing to verified browsers (just latest few user-agents from the 10ish different mainstream browsers on the 3 main OSes) and another to anything else.

Yes AI scrapers can easily spoof user-agent, but they fall out of date as the browser updates.

Bit harder to catch them in tarpits and then serve nonsense to whoever ever triggered the tarpit.

thfuran 8 hours ago | [-1 more]

>Yes AI scrapers can easily spoof user-agent, but they fall out of date as the browser updates.

It’s a hell of a lot easier for a company to ensure that its scrapers all report the latest user agent string than it is to get everyone and their mother to update their browsers in a timely fashion.

microgpt 4 hours ago | [-0 more]

yeah but unless everyone is checking the version, if it's just a handful of websites checking it, they don't.

and browsers forcibly auto-update

rvnx 8 hours ago | [-10 more]
something98 8 hours ago | [-7 more]

This is a very misleading statement; most of those physicians are using LLMs to transcribe notes from visits and/or for billing purposes (e.g., proper billing codes).

kjellsbells 7 hours ago | [-3 more]

The problems isnt LLMs per se, it is the shift to trusting the output of the machine coupled with a decline in verifying that the output is reasonable. It's basically what your teachers warned you about with wikipedia in eight grade except applied to all areas of life, including medicine. Dictation is already high-stakes and LLMs do not automatically reduce that risk.

Here is an example. My provider sent me this note. I'm quoting verbatim here from my MyChart record:

"Your liver enzymes are high, I would like to order acetaminophen containing medication like Tylenol, I would like to order liver ultrasound I placed ultrasound order in the system, make an appointment for radiology, I would like you to get hepatitis panel lab work done, obtain blood work order, please schedule a well visit to get it done"

When I queried it, this is what I got back. It was a dictation error. You could almost hear the panic in the message:

"Sorry for wrong message earlier, I was dictated message- so could not realize that it was written to take Tylenol type of medicines- I DO NOT RECOMMEND ACETAMINOPHEN CONTAINING MEDICINE - LIKE TYLENOL AND ALCOHOL DUE TO ELEVATED LIVER ENZYMES."

Again the problem is not dictation, or LLMs. The problem is humans ignoring their responsibility to check the output of a machine.

ethbr1 6 hours ago | [-2 more]

> Again the problem is not dictation, or LLMs. The problem is humans ignoring their responsibility to check the output of a machine.

100%. Also, management.

I wish someone would go ahead and coin an AI version of Amdahl's law that states the work speedup from AI is dependent on amount of unverified AI output used.

Iow, if you 1:1 verified everything, there would be no time savings.

Ergo, you get management saying (1) we demand time savings due to AI & (2) we demand you fully check anything you use AI for.

End result? People skip (2) to hit (1).

Then management burns anyone at the stake whenever inevitable mistakes happen.

lazyasciiart 6 hours ago | [-1 more]

But that’s trivially false. There is an entire category of work where it is hard to come up with an answer and easy to verify the answer, which means that if you verified everything there would still be a large time savings.

ethbr1 5 hours ago | [-0 more]

I would question whether that holds in the practical LLM automation space.

Can you think of any real life examples where an LLM is likely to be used?

I think in practice what you're saying is there are problems where there exist efficient deterministic verification methods, and I'm sure that's true.

But that's not the bulk of everyday work LLMs are being asked to do nowadays across industry.

girvo 5 hours ago | [-0 more]

Which is itself a problem as (in my partners evaluations as an optometrist), LLMs used for clinical notes has a bad habit of dropping clinically important information, and the biggest providers don’t give you a copy of the raw transcript or a recording

Which means she ends up spending just as much time as if she’d done it herself as it needs to be verified for accuracy every time…

brokencode 8 hours ago | [-1 more]

OpenEvidence is specifically meant to help clinicians make evidence-based decisions in the diagnosis and treatment of patients, not note transcription.

sxg 8 hours ago | [-0 more]
sarchertech 8 hours ago | [-0 more]

Ignoring the fact that this number comes from a company press release, it doesn’t say anything about the number of doctors using it to diagnose, just that they use it.

If a physician uses Google to search for a dosage chart for some drug they rarely prescribe, you wouldn’t say they are using Google to diagnose the patient. You wouldn’t say that either if they used Google to search for the most recent studies on a topic.

sambellll 8 hours ago | [-0 more]

To me this is like a good software engineer using AI.

The fact that they use it doesn't make what the result is any worse or less trustworthy - arguably it makes it better.

It only becomes a problem if they offload all of the thinking to AI.