by vunderba 9 hours ago

Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

Try generating:

- A spider missing one leg

- A 9-pointed star

- A 5-leaf clover

- A man with six fingers on his left hand and four fingers on his right

You'll be lucky to get a 25% success rate.

The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

moonu 8 hours ago | [-1 more]
vunderba 8 hours ago | [-0 more]

Some good examples there. The octopus one is at an angle - can't really call that one pass (unless the goal is "VISIBLE" tentacles).

Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.

XenophileJKO 7 hours ago | [-1 more]

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

vunderba 3 hours ago | [-0 more]

Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).

For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.