I'm curious about the current limitations of image generators. Specifically, why don't they have a reasoning stage that allows them to review what they've created? It seems like it could be useful for these generators to recognize when an image doesn't match the user's request and potentially make adjustments.
7 Answers
Every time I generate a Rube Goldberg machine, it looks like one but without a logical sequence of actions. It even points out why it doesn't work but then just repeats the same mistake with a new design, which is really frustrating.
It's like a waiter and chef scenario. The AI understands your request but if there's a communication gap in how the image is created, the output can miss the mark. Even if the reasoning module can identify the issues, it doesn't change the fact that the image creation module isn't getting it right.
Most image generators are based on diffusion models, not transformers, which limits the reasoning capabilities during the generation process.
Yeah, but recently most of these models have been shifting to transformers. It's interesting how quickly things are evolving in AI.
Honestly, it's just a matter of resources. Even though these big tech labs have access to loads of computing power, they still prioritize efficiency and cost, leading to these inconsistencies in AI behavior.
I think this type of reasoning has been demonstrated, like by Google at some point. However, it's time-intensive and requires a lot of computing power. Plus, the problem is that if a model keeps modifying the same image, it can compound its errors instead of fixing them.
Generating images and doing reasoning takes a lot of resources. Plus, if you have the AI creating a new image on its own, it might not align with what the user is thinking. A workaround could be to just refine the prompt after the first generation instead of starting fresh every time.
Or maybe they could focus on inpainting specific areas that need improvement instead of redoing the whole image, which could save time.
The whole reasoning versus image generation divide seems to stem from tech limitations. Different models have differing approaches, and this affects how they handle tasks like generating complex sequences of actions, which might be tougher due to data availability.
There's a model called Bagel that can reason before generating an image, which can also take in image inputs to make adjustments, but I’m not sure how well it performs in practice.
It's totally annoying! If you know how AI works, you'd expect some of these hiccups, but you're left wondering why they can't just get it right on the first try.