EvalPro

This website is a collection of two things

HumanEvalNext: A benchmark for evaluating the performance of Large Language Models (LLMs) on the task of code generation.
BenchScout: A tool to assist researchers and developers in finding the appropriate dataset for a model given the task it was designed for.

📊HumanEvalNext

HumanEvalNext is an enhanced foundation for the family of benchmarks around the HumanEval benchmark. It aims to resolve the issues of the original dataset.

Read more View HumanEvalNext Leaderboards

🔎BenchScout

BenchScout is a suffisticated search engine that helps researchers and developers find the appropriate dataset for a model given the task it was designed for. It supports search by task, similarity to other papers, and more.

Go to BenchScout Suggest a paper