by cyanmagenta 5 hours ago

I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say “that’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

FeepingCreature 4 hours ago | [-0 more]

Your kid, it should be noted, has a massively bigger brain than the LLM. I think the surprising thing here maybe isn't that the vision models don't work well in corner cases but that they work at all.

Also my bet would be that video capable models are better at this.