The question to ask when choosing a LLM is ‘what’s the right model for the task?‘. There are two basic aspects that might contribute to this decision:
- Fundamental model attributes
- Parameters
- Context length
- Pricing
- Also consider
- Rate limits
- Speed
- Latency
- Licence - e.g. commercial restrictions
- Usage restrictions
- Revenue caps
- Legal agreements
- Benchmarks
- Performance metrics.
Approach
- Develop a short list
- Identify those that you can actually use - e.g. consider pricing
- Chat vs reasoning vs hybrid models
- Knowledge cut off dates (determines what is in the model’s inherent training set - indicates whether additional information will need to be provided to the model)
- Look at model information and benchmark results
- Sources
- Model cards published by providers
- Leaderboards
- Arenas
- Capabilities
- Coding
- Language understanding
- Reasoning
- Sources
A good place to look for benchmark information is https://artificialanalysis.ai/evaluations:
- Humanity’s Last Exam - Intelligence at a superhuman level
- GPQA Diamond Leaderboard - difficut questions across physics, chemistry and biology (‘google proof’)
- MMLU Pro - language understanding
- AIME - mathematical solving abilities
- LiveCodeBench - the model’s ability to write code based on specific problems
- MuSR - a model’s ability to ‘think its way through a problem
Note that benchmarks have limitations. They can suffer from training data contamination, for example.