What Makes a Good AI Benchmark?
What Makes a Good AI Benchmark?
This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.
Key Takeaways
- The rapid advancement and proliferation of AI systems, including foundation models, has catalyzed the widespread adoption of AI benchmarks—yet only very limited research to date has evaluated the quality of AI benchmarks in a structured manner.
- We reviewed benchmarking literature and interviewed expert stakeholders to define what makes a high-quality benchmark, and developed a novel assessment framework for evaluating AI benchmarks based on 46 criteria across five benchmark life-cycle phases.
- In scoring 24 AI benchmarks, we found large quality differences between them, including those widely relied on by developers and policymakers. Most benchmarks are highest quality at the design stage and lowest quality at the implementation stage.
- Policymakers should encourage developers, companies, civil society groups, and government organizations to articulate benchmark quality when conducting or relying on AI model evaluations and consult best practices for minimum quality assurance.
Read the full brief here