Hugging Face embeds 200+ benchmark scores directly on model cards
Model pages now display Every Eval Ever results inline, consolidating performance data across 200+ academic and community benchmarks without requiring practitioners to hunt across external leaderboards.

Hugging Face model cards now display Every Eval Ever benchmark results inline, bringing performance data from more than 200 academic and community evaluations into a single consolidated view. Every Eval Ever is a community-driven benchmark aggregator that runs models against tasks spanning reasoning, instruction-following, code generation, and multimodal understanding. The integration means practitioners no longer need to hunt across scattered leaderboards or reproduce evals locally to compare candidates.
Results appear as a dedicated section on the model card, with breakdowns by task category and links to detailed metric tables. Models that haven't been evaluated yet show a prompt to request a run. Every Eval Ever itself is maintained by independent researchers who publish eval code and results openly, drawing from both established academic sets like MMLU and GSM8K and newer community-contributed tasks that test edge cases or domain-specific capabilities. Hugging Face is hosting the integration but does not control which benchmarks are included or how they're weighted.
The system is designed to scale as new evals are added; coverage will expand to include vision-language and audio tasks in the coming months. The next step is tighter filtering—users want to sort models by performance on specific task subsets or compare only within a parameter-count band. That functionality is on the roadmap but not yet live, so for now the card view is read-only. Watch for updates as the eval catalog grows and filtering options roll out over the next quarter.




