I'm curious about the benchmarks everyone uses to evaluate the performance of different language models (LLMs). I recently went back to using ChatGPT after trying Claude for a bit, and I've noticed there are tons of models available now. When you're looking to compare their abilities—especially for tasks like coding or writing—what benchmarks do you typically refer to?
2 Answers
I've heard good things about livebench and Aider Polyglot. They seem to be solid choices for coding tasks, especially for complex scenarios!
Honestly, I don't put much stock in benchmarks. What matters to me is how well a model performs in my specific use cases. If a model flunks some benchmarks but excels in the languages I use, I'm good with it! I just try out models until I find one that suits me right.
Exactly! Benchmarks can be misleading if they don't align with what you actually need.
Right? Performance for specific tasks is what really counts.