This website is a collection of two things
- HumanEvalPro: A benchmark for evaluating the performance of Large Language Models (LLMs) on the task of code generation.
- EvalProSearch: A tool to assist researchers and developers in finding the appropriate dataset for a model given the task it was designed for.
📊HumanEvalPro
HumanEvalPro is an enhanced foundation for the family of benchmarks around the
HumanEval benchmark. It aims to resolve the issues
of the original dataset.
This dataset aims to resolve the issues of the original dataset. Subsequently, enhanced variants of the original dataset can be generated from this repository. This is crucial, as variants based off the original HumanEval version generally contain these issues:
- Variants that cover multiple languages have duplicated the original issues.
- Variants that added tests used the original incorrect solutions to generate the output.
- Variants based on human corrections or translations are inconsistent.
This new version, HumanEvalPro, contains improved documentation, proper canonical solutions and substantially improved test coverage. The complexity of (some/most) problems has been increased with changes in the problem descriptions and tests now requiring models to be thoughtful about edge cases similar to the intuition of adequate software engineers.
🔎EvalProSearch
EvalProSearch is a suffisticated search engine that helps researchers and developers find the appropriate dataset for a model given the task it was designed for.
It supports search by task, similarity to other papers, and more.