Blog

Why Do Top LLMs Struggle with a Simple Comparison Question?

May 13, 2025

Asked By CleverBanana99 On May 13, 2025

I recently tested three of the leading language models, O3, Sonnet 3.7, and Gemini 2.5 Pro, with a straightforward question: "Which is larger, 9.9 or 9.11?" Surprisingly, one of these models had a tough time providing the right answer. This raises an interesting question about how different LLMs interpret simple numerical comparisons and why they can struggle with them. In this instance, the models seem to have varied interpretations of 9.11, seeing it as versioning rather than a mathematical question. I even wrote a blog post about it, which dives deeper into the topic. You can check it out [here](https://tryaii.com/blog/llms-decimal-comparison-problem) and try it out with the models [here](https://tryaii.com/compare?prompt=Which+number+is+greater%2C+9.9+or+9.11%3F&models=claude-3-7-sonnet-20250219%2Co3%2Cgemini-2.5-flash-preview-04-17). In short, it seems like Claude Sonnet 3.7 has a problematic track record with this kind of question, while earlier versions of Gemini and OpenAI models also trip up at times.

7 Answers

Answered By PleasedReader63 On May 14, 2025

Hey, I just read your blog, and it’s really well done! You laid out the complexities of LLMs without getting too technical, which is a tough balance to strike. Just a friendly suggestion: adding some technical details to your Reddit posts could help you stand out more, as many self-claimed 'AI experts' miss the mark on key concepts.

Answered By CriticalObserver77 On May 14, 2025

Your blog misses the point. These reasoning models are usually geared towards multi-step challenges, not a simple comparison like this! It's kind of surprising that they struggle because this is a relatively easy concept in terms of tokenization.

Answered By SnarkyPenguin45 On May 14, 2025

Wow, 9.9 vs. 9.11 is clearly a thrilling conundrum! But joking aside, it shows how context matters with AI. The term 'bigger' can mean different things without clear direction on what you’re asking. Models struggle with vague queries.

Answered By CriticalThinker56 On May 14, 2025

If you ask LLMs really basic or generic questions, you might end up with unsatisfactory answers. It’s all about how specific you can be with context, right? Many issues with current models stem from their inability to tackle these types of open-ended prompts effectively.

Answered By ThoughtfulSage01 On May 14, 2025

The fact that these models don't consistently answer correctly doesn't really reflect their overall capabilities. It’s interesting how some of them might see this question as versioning, thinking along lines like version 9.11 being higher than 9.9, while others view it strictly as a mathematical comparison.

ChattyBee47 - May 14, 2025

Yeah, exactly! One of the models even broke it down mathematically instead of considering it as a versioning issue. It's fascinating how their training data can influence these interpretations.

Answered By TechieGuru88 On May 13, 2025

Honestly, I think Sonnet 3.7 is the most robust across various tasks, though! It's great for coding and does well in many other scenarios.

CasualCoder99 - May 14, 2025

I agree, especially for coding tasks! It's still my go-to for those.

Answered By CuriousMind22 On May 13, 2025

This discussion has been around for quite a while. Research on interpretability has shown that there's often a mix-up in the circuits that respond to similar questions. The circuits trained to understand things like Bible verses might get activated with a question like this, so it defaults to interpreting it in a versioning context rather than a purely mathematical one.

GeeWhiz88 - May 14, 2025

I guess I thought with all the advancements, models would handle this better. It's still surprising!

Why Do Top LLMs Struggle with a Simple Comparison Question?

7 Answers

Related Questions

Sports Team Randomizer

10 Uses For An Old Smartphone

Midjourney Launches An Exciting New Feature for Their Image AI

ShortlyAI Review

Is Copytrack A Scam?

Getting 100 on Pagespeed Insights for Mobile is Impossible

LEAVE A REPLY Cancel reply