I'm curious about the benchmarks everyone uses to evaluate the performance of different language models (LLMs). I recently went back to using ChatGPT after trying Claude for a bit, and I've noticed there are tons of models available now. When you're looking to compare their abilities—especially for tasks like coding or writing—what benchmarks do you typically refer to?
3 Answers
I usually check out MMLU for general knowledge and GSM8K for math reasoning. For coding, HumanEval is pretty good. It really depends on what skills you’re looking for in a model, though!
Yeah, I focus on coding as well and I've found HumanEval super helpful in that regard.
I've heard good things about livebench and Aider Polyglot. They seem to be solid choices for coding tasks, especially for complex scenarios!
Honestly, I don't put much stock in benchmarks. What matters to me is how well a model performs in my specific use cases. If a model flunks some benchmarks but excels in the languages I use, I'm good with it! I just try out models until I find one that suits me right.
Right? Performance for specific tasks is what really counts.
Exactly! Benchmarks can be misleading if they don't align with what you actually need.
Totally agree, those benchmarks give a nice overview! But I also think testing them personally can reveal a lot about how well they'll fit my needs.