Use public PyPI See merge request eng/ai/ada-eval!11
Ada Eval
Framework for evaluating LLM based tools for Ada/SPARK use cases.
Setup
To get started with this project, you will need the following tools installed on your system:
You will also need an Ada toolchain and GNATprove installed on your system. Some options for doing this:
Install uv
uv should not be installed in the project's virtual environment. We can use pipx to install it globally, but still in a virtual environment. For alternative methods, see the official docs.
# Using official script:
curl -LsSf https://astral.sh/uv/install.sh | sh
# Using pipx
pipx install uv
# Using brew
brew install uv
# Using cargo
cargo install --git https://github.com/astral-sh/uv uv
uv can update itself by running:
uv self update
Optionally, you can also install completions for uv and uvx:
# Determine your shell (e.g., with `echo $SHELL`), then run one of:
# For bash
echo 'eval "$(uv generate-shell-completion bash)"' >> ~/.bashrc
echo 'eval "$(uvx --generate-shell-completion bash)"' >> ~/.bashrc
# For zsh
echo 'eval "$(uv generate-shell-completion zsh)"' >> ~/.zshrc
echo 'eval "$(uvx --generate-shell-completion zsh)"' >> ~/.zshrc
# For fish
echo 'uv generate-shell-completion fish | source' >> ~/.config/fish/config.fish
echo 'uvx --generate-shell-completion fish | source' >> ~/.config/fish/config.fish
# For elvish
echo 'eval (uv generate-shell-completion elvish | slurp)' >> ~/.elvish/rc.elv
echo 'eval (uvx --generate-shell-completion elvish | slurp)' >> ~/.elvish/rc.elv
See docs for more info about uv.
ANCR Setup
Note that this won't work on macOS systems.
If you haven't already, you will need to clone ANCR. You should then run:
./bin/ancr all mcp
./bin/ancr shell mcp
code ../ada-eval
uv sync
You should now have a working Ada toolchain and GNATprove installed and in your environment.
Alire Setup
If you're using macOS, you can use Alire to set up your environment by running:
alr install gnat_native gnatprove gprbuild
You must ensure that ~/.alire/bin/ is in your PATH.
Manual Setup
Figure out how to install everything yourself π
Per-clone setup
Do this when you first clone the project. Use uv to install python dependencies (this will automatically create a virtual environment):
uv sync
By default, uv will create a virtual environment named .venv in your project. If you're using VS Code, you can use Python: Select Interpreter to select Python from the virtual environment generated by uv. This will enable code navigation.
If you don't use VS Code, you use the standard source .venv/bin/activate or equivalent to activate the virtual environment.
Alternatively, you can run commands in the project's virtual environment without activating it by running uv run ... (e.g., uv run pytest).
Project Development Info
Project structure
At a high level, the project is structured as follows:
.
βββ data # Data directory containing various datasets
β βββ base # Base data directory
β β βββ compacted # Contains datasets in single jsonl files
β β βββ expanded # Contains expanded datasets. Much easier to read and modify vs the jsonl files
β βββ evaluated # Datasets that contain generated completions and corresponding evaluation results
β βββ generated # Datasets that contain generated completions
βββ src # Source code for ada-eval
βββ tests # Tests for ada-eval
βββ tools # Tools used by ada-eval during generation
βββ Makefile # Contains various commands for convenience
βββ pyproject.toml # Defines project dependencies and configuration
βββ README.md # This file
βββ uv.lock # Lock file for dependencies managed by uv
Common Commands
To format your code, run:
make format
To run linter/type checker, run:
make check
To run the testsuite run:
make test
Adding or Updating Python Dependencies
To add a new Python dependency using uv, run
uv add <package-name>
uv sync
For more info see docs.
Our GitLab CI uses an internal package registry for Python dependencies, so the following additional steps are required if the new package(s) is not already in our registry.
- For each file which needs to be added to our package registry, find its URL (which will start with
https://files.pythonhosted.org/packages) inuv.lock, and download it to a temporary directory withcurl -L -O --output-dir <temp_directory> <url> - Upload the files with
TWINE_USERNAME=oauth2 TWINE_PASSWORD=<personal_access_token> uvx twine upload --repository-url https://<gitlab_host>/api/v4/projects/<project_id_or_percent_encoded_path>/packages/pypi --skip-existing <temp_directory>/*
Usage
The project has a simple CLI, as well as a makefile that contains some useful commands for your convenience.
You can see the CLI options by running:
uv run ada-eval --help
If you look at the make target generate-spark-claude, you will see an example of how to generate completions for our existing challenges, using Claude Code.
make generate-spark-claude
Adding a new Sample (Challenge/Problem)
In data/base/expanded, you will find directories that each represent a dataset, which contain a number of samples (challenges/problems). The sample will have a different structure depending on the type of the dataset. Currently we only have SPARK datasets, though there is some support for Explain and Ada datasets too.
The general idea of each of these types are:
- Ada: Will require the modification of some Ada code to solve the challenge/problem
- SPARK: Will require the modification of some SPARK code to solve the challenge/problem. This could include adding new SPARK contracts, modifying existing ones, or changing the implementation to satisfy the given challenge.
- Explain: Contains some Ada/SPARK code and a question about it. The goal is to generate an explanation of the code that answers the question.
The rough structure of a SPARK sample is as follows:
example_sample
βββ base # The project for the challenge. This will be duplicated somewhere for a tool to attempt to modify to solve the challenge
β βββ main.gpr
β βββ src
βββ comments.md # Comments about the sample for humans to better understand it
βββ other.json # Contains additional metadata for the sample. For SPARK samples, this will include the name of the subprogram of interest, and the relative path to the file that contains its specification.
βββ prompt.md # The prompt that is provided to the tool to specify the challenge/problem
βββ solution # A solution to the challenge/problem. For SPARK samples, this will be a project that can be built, passes the tests, and gnatprove is happy with.
β βββ main.gpr
β βββ main.adc
β βββ src
βββ tests # Unit test project for the sample. Used in addition to gnatprove to verify that the solution is correct.
βββ src
βββ tests.gpr
Adding a new SPARK sample
- Create a new directory in
data/base/expanded/your_dataset_of_choicewith the name of the sample. - Add the following:
-
base/: A project that contains the code that needs to be modified to solve the challenge/problem -
solution/: A complete project that contains the modified code with the solved problem -
tests/: A project that contains unit tests for the sample. It should compile to an executable attests/bin/testswhich returns a zero exit code if (and only if) the tests pass. -
comments.md: if the challenge isn't obvious, consider including a more detailed description of the challenge/problem in this file -
prompt.md: A prompt that describes the challenge/problem. This will be provided to the tool to generate a solution. -
other.json: A file that contains additional metadata about the sample. For SPARK samples, this should look like:{ "location": { "path": "src/relative/path/to/file.ads", // Should be relative to the base/solution directory of the sample "subprogram_name": "Name_Of_Subprogram" // The name of the subprogram that is the focus of the challenge/problem }, "required_checks": [ { "rule": "VC_POSTCONDITION", // Rule name of a check that must be proved "entity_name": "Integer_Utils.Increment", // Optional name of the entity the check must be attached to "src_pattern": "Increment'Result = X \\+ 1;" // Optional regex pattern which must match the proved source code }, ... ] }The
"path"from the"location"is used as the argument tognatprove, so only the corresponding unit, and any it (recursively) depends on, is/are analysed during evaluation.The
"required_checks"list is optional, and unnecessary for most samples, since the proof and/or tests will simply fail unless the problem is solved. However, for checks like postconditions, a desperate agent may "fix" the proof failure by simply removing or relaxing the condition. Where the proof of a postcondition or similar is a key part of a challenge/problem, a correspondingrequired_checkshould be added to ensure that such "fixes" are not erroneously evaluated as correct. The"src_pattern"must match starting from the location reported bygnatprove(rungnatproveon the solution with--report=allfor an output which includes the line and column numbers of these locations).
-
- Verify that the solution builds, proves, etc. by running
to populate
make evaluate-canonicalother.jsonand thento verify everything is correct.make check-datasets
Generating new completions
To generate new completions for one or multiple datasets, you can use the generate command of the CLI. For example, to generate completions for all datasets using Claude Code, you can run:
uv run ada-eval generate --tool shell_script --tool-config-file tools/configs/claude_code_no_mcp.json
This also available via the shortcut make generate-spark-claude.
Currently you have to specify the type of tool you want to use, and the configuration file for that tool. The configuration files provide a place for modifying settings. Currently there are only shell tools, and the only values are used to specify where the script is located, and how long it should be allowed to run for.
This interface is not final.
Evaluating completions
To evaluate completions that were generated as described in the previous section, you can run
uv run ada-eval evaluate
which will run all available evaluations on the generations in data/generated/, and save the results to data/evaluated/. This also available via the shortcut make evaluate.
For more options, such as specifying which evaluations are run, see the output of
uv run ada-eval evaluate --help
This interface is not final.