I've been enjoying using GPT and other large language models for various tasks, including some quirky projects like archiving my coffee bean purchases over the years. So, I saved the empty bags and took some not-so-great pictures of them, then ran those images through several AIs, including GPT-4, GPT-3, and Gemini 2.5 Pro Exp. I specifically asked them to extract accurate information from the images without making things up or leaving blanks when they weren't sure. However, I was really disappointed with GPT-4; it missed bags, misspelled basic names, and even made up tasting notes. When I tried to correct it, it just seemed to get worse and invented even more mistakes. In contrast, Gemini performed exceptionally well, retrieving details quickly and accurately, though it too included some fabricated tasting notes. I'm perplexed by why there's such a huge difference in performance on this straightforward task, which seems to just involve OCR and image interpretation. Any insights?
2 Answers
Honestly, the issue might be related to how you’re prompting the AIs. If the prompts are vague or leave room for interpretation, models like GPT-4 tend to fill in those gaps with their own fabrications. Specifying exactly what you want can minimize these errors, but it seems challenging with complex tasks like yours!
It sounds like you've encountered a pretty common issue! I’ve noticed that GPT-4 and similar models can start hallucinating more frequently, especially after updates. It might have been fine-tuning its responses, but that can lead to these mistake-prone outputs. It's a tricky thing they're working on—it can be more erratic than usual!
That's a good point! It seems like when handling multiple images at once, the model got overwhelmed and started making stuff up. I tried uploading one image at a time, and that made a noticeable difference. Less confusion, more accurate recognition!