Sebastian M'Caw 2fbfb81b34 Merge branch 'mr/use-public-pypi' into 'main'
Use public PyPI

See merge request eng/ai/ada-eval!11
2025-11-20 16:03:10 +00:00
2025-11-04 14:22:14 +00:00
2025-11-20 13:44:55 +00:00
2025-09-25 08:33:47 +01:00
2025-11-20 12:37:43 +00:00
2025-09-08 16:12:04 +01:00
2025-10-30 11:20:31 +00:00
2025-11-20 12:37:43 +00:00
2025-11-20 12:37:43 +00:00

Ada Eval

Framework for evaluating LLM based tools for Ada/SPARK use cases.

Setup

To get started with this project, you will need the following tools installed on your system:

You will also need an Ada toolchain and GNATprove installed on your system. Some options for doing this:

  1. ANCR
  2. Alire
  3. Manual

Install uv

uv should not be installed in the project's virtual environment. We can use pipx to install it globally, but still in a virtual environment. For alternative methods, see the official docs.

# Using official script:
curl -LsSf https://astral.sh/uv/install.sh | sh

# Using pipx
pipx install uv

# Using brew
brew install uv

# Using cargo
cargo install --git https://github.com/astral-sh/uv uv

uv can update itself by running:

uv self update

Optionally, you can also install completions for uv and uvx:

# Determine your shell (e.g., with `echo $SHELL`), then run one of:

# For bash
echo 'eval "$(uv generate-shell-completion bash)"' >> ~/.bashrc
echo 'eval "$(uvx --generate-shell-completion bash)"' >> ~/.bashrc

# For zsh
echo 'eval "$(uv generate-shell-completion zsh)"' >> ~/.zshrc
echo 'eval "$(uvx --generate-shell-completion zsh)"' >> ~/.zshrc

# For fish
echo 'uv generate-shell-completion fish | source' >> ~/.config/fish/config.fish
echo 'uvx --generate-shell-completion fish | source' >> ~/.config/fish/config.fish

# For elvish
echo 'eval (uv generate-shell-completion elvish | slurp)' >> ~/.elvish/rc.elv
echo 'eval (uvx --generate-shell-completion elvish | slurp)' >> ~/.elvish/rc.elv

See docs for more info about uv.

ANCR Setup

Note that this won't work on macOS systems.

If you haven't already, you will need to clone ANCR. You should then run:

./bin/ancr all mcp
./bin/ancr shell mcp
code ../ada-eval
uv sync

You should now have a working Ada toolchain and GNATprove installed and in your environment.

Alire Setup

If you're using macOS, you can use Alire to set up your environment by running:

alr install gnat_native gnatprove gprbuild

You must ensure that ~/.alire/bin/ is in your PATH.

Manual Setup

Figure out how to install everything yourself πŸ‘

Per-clone setup

Do this when you first clone the project. Use uv to install python dependencies (this will automatically create a virtual environment):

uv sync

By default, uv will create a virtual environment named .venv in your project. If you're using VS Code, you can use Python: Select Interpreter to select Python from the virtual environment generated by uv. This will enable code navigation.

If you don't use VS Code, you use the standard source .venv/bin/activate or equivalent to activate the virtual environment.

Alternatively, you can run commands in the project's virtual environment without activating it by running uv run ... (e.g., uv run pytest).

Project Development Info

Project structure

At a high level, the project is structured as follows:

.
β”œβ”€β”€ data  # Data directory containing various datasets
β”‚   β”œβ”€β”€ base  # Base data directory
β”‚   β”‚   β”œβ”€β”€ compacted  # Contains datasets in single jsonl files
β”‚   β”‚   └── expanded  # Contains expanded datasets. Much easier to read and modify vs the jsonl files
β”‚   β”œβ”€β”€ evaluated  # Datasets that contain generated completions and corresponding evaluation results
β”‚   └── generated  # Datasets that contain generated completions
β”œβ”€β”€ src  # Source code for ada-eval
β”œβ”€β”€ tests  # Tests for ada-eval
β”œβ”€β”€ tools  # Tools used by ada-eval during generation
β”œβ”€β”€ Makefile  # Contains various commands for convenience
β”œβ”€β”€ pyproject.toml  # Defines project dependencies and configuration
β”œβ”€β”€ README.md  # This file
└── uv.lock  # Lock file for dependencies managed by uv

Common Commands

To format your code, run:

make format

To run linter/type checker, run:

make check

To run the testsuite run:

make test

Adding or Updating Python Dependencies

To add a new Python dependency using uv, run

uv add <package-name>
uv sync

For more info see docs.

Our GitLab CI uses an internal package registry for Python dependencies, so the following additional steps are required if the new package(s) is not already in our registry.

  1. For each file which needs to be added to our package registry, find its URL (which will start with https://files.pythonhosted.org/packages) in uv.lock, and download it to a temporary directory with
    curl -L -O --output-dir <temp_directory> <url>
    
  2. Upload the files with
    TWINE_USERNAME=oauth2
    TWINE_PASSWORD=<personal_access_token>
    uvx twine upload --repository-url https://<gitlab_host>/api/v4/projects/<project_id_or_percent_encoded_path>/packages/pypi --skip-existing <temp_directory>/*
    

Usage

The project has a simple CLI, as well as a makefile that contains some useful commands for your convenience.

You can see the CLI options by running:

uv run ada-eval --help

If you look at the make target generate-spark-claude, you will see an example of how to generate completions for our existing challenges, using Claude Code.

make generate-spark-claude

Adding a new Sample (Challenge/Problem)

In data/base/expanded, you will find directories that each represent a dataset, which contain a number of samples (challenges/problems). The sample will have a different structure depending on the type of the dataset. Currently we only have SPARK datasets, though there is some support for Explain and Ada datasets too.

The general idea of each of these types are:

  • Ada: Will require the modification of some Ada code to solve the challenge/problem
  • SPARK: Will require the modification of some SPARK code to solve the challenge/problem. This could include adding new SPARK contracts, modifying existing ones, or changing the implementation to satisfy the given challenge.
  • Explain: Contains some Ada/SPARK code and a question about it. The goal is to generate an explanation of the code that answers the question.

The rough structure of a SPARK sample is as follows:

example_sample
β”œβ”€β”€ base  # The project for the challenge. This will be duplicated somewhere for a tool to attempt to modify to solve the challenge
β”‚   β”œβ”€β”€ main.gpr
β”‚   └── src
β”œβ”€β”€ comments.md  # Comments about the sample for humans to better understand it
β”œβ”€β”€ other.json  # Contains additional metadata for the sample. For SPARK samples, this will include the name of the subprogram of interest, and the relative path to the file that contains its specification.
β”œβ”€β”€ prompt.md  # The prompt that is provided to the tool to specify the challenge/problem
β”œβ”€β”€ solution  # A solution to the challenge/problem. For SPARK samples, this will be a project that can be built, passes the tests, and gnatprove is happy with.
β”‚   β”œβ”€β”€ main.gpr
β”‚   β”œβ”€β”€ main.adc
β”‚   └── src
└── tests  # Unit test project for the sample. Used in addition to gnatprove to verify that the solution is correct.
    β”œβ”€β”€ src
    └── tests.gpr

Adding a new SPARK sample

  1. Create a new directory in data/base/expanded/your_dataset_of_choice with the name of the sample.
  2. Add the following:
    • base/: A project that contains the code that needs to be modified to solve the challenge/problem

    • solution/: A complete project that contains the modified code with the solved problem

    • tests/: A project that contains unit tests for the sample. It should compile to an executable at tests/bin/tests which returns a zero exit code if (and only if) the tests pass.

    • comments.md: if the challenge isn't obvious, consider including a more detailed description of the challenge/problem in this file

    • prompt.md: A prompt that describes the challenge/problem. This will be provided to the tool to generate a solution.

    • other.json: A file that contains additional metadata about the sample. For SPARK samples, this should look like:

      {
          "location": {
              "path": "src/relative/path/to/file.ads",  // Should be relative to the base/solution directory of the sample
              "subprogram_name": "Name_Of_Subprogram"  // The name of the subprogram that is the focus of the challenge/problem
          },
          "required_checks": [
              {
                  "rule": "VC_POSTCONDITION",  // Rule name of a check that must be proved
                  "entity_name": "Integer_Utils.Increment",  // Optional name of the entity the check must be attached to
                  "src_pattern": "Increment'Result = X \\+ 1;"  // Optional regex pattern which must match the proved source code
              },
              ...
          ]
      }
      

      The "path" from the "location" is used as the argument to gnatprove, so only the corresponding unit, and any it (recursively) depends on, is/are analysed during evaluation.

      The "required_checks" list is optional, and unnecessary for most samples, since the proof and/or tests will simply fail unless the problem is solved. However, for checks like postconditions, a desperate agent may "fix" the proof failure by simply removing or relaxing the condition. Where the proof of a postcondition or similar is a key part of a challenge/problem, a corresponding required_check should be added to ensure that such "fixes" are not erroneously evaluated as correct. The "src_pattern" must match starting from the location reported by gnatprove (run gnatprove on the solution with --report=all for an output which includes the line and column numbers of these locations).

  3. Verify that the solution builds, proves, etc. by running
    make evaluate-canonical
    
    to populate other.json and then
    make check-datasets
    
    to verify everything is correct.

Generating new completions

To generate new completions for one or multiple datasets, you can use the generate command of the CLI. For example, to generate completions for all datasets using Claude Code, you can run:

uv run ada-eval generate --tool shell_script --tool-config-file tools/configs/claude_code_no_mcp.json

This also available via the shortcut make generate-spark-claude.

Currently you have to specify the type of tool you want to use, and the configuration file for that tool. The configuration files provide a place for modifying settings. Currently there are only shell tools, and the only values are used to specify where the script is located, and how long it should be allowed to run for.

This interface is not final.

Evaluating completions

To evaluate completions that were generated as described in the previous section, you can run

uv run ada-eval evaluate

which will run all available evaluations on the generations in data/generated/, and save the results to data/evaluated/. This also available via the shortcut make evaluate.

For more options, such as specifying which evaluations are run, see the output of

uv run ada-eval evaluate --help

This interface is not final.

Description
No description provided
Readme Apache-2.0 855 KiB
Languages
Python 82.2%
Ada 17.3%
Makefile 0.3%
Shell 0.2%