Skip to content

[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087

Draft
kotlarmilos wants to merge 1 commit into
dotnet:mainfrom
kotlarmilos:kotlarmilos/ci-workflow-evals
Draft

[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087
kotlarmilos wants to merge 1 commit into
dotnet:mainfrom
kotlarmilos:kotlarmilos/ci-workflow-evals

Conversation

@kotlarmilos

@kotlarmilos kotlarmilos commented Jul 1, 2026

Copy link
Copy Markdown
Member

Description

This PR stands up the infrastructure for evaluating the three CI outer-loop agentic workflows, ci-failure-scan, ci-failure-fix, and ci-failure-scan-feedback, using Vally. It does not change the workflows themselves and does not wire the eval into required CI; it adds a harness a maintainer runs on demand.

The intent is two-fold. The part this PR implements is a pre-merge check: when we change one of the prompts or switch the model behind it, a maintainer comments /ci-scan eval, /ci-fix eval, /ci-feedback eval, or /ci-eval on the PR, and ci-eval.yml checks out the PR head, runs the affected Vally spec against the Copilot-SDK executor, and posts a single pass/fail report back. Each spec runs its prompt against a staged offline fixture so the run is deterministic, then grades the artifact the agent produces with three static regex checks on the content and three binary LLM-judge dimensions, at a 0.7 threshold so a single grader can miss without failing the run. The graders look for substance rather than format alone: a scan producing a single stable KBE grounded in a log excerpt that exists in the fixture, a fix emitting a real PR shape or a reasoned comment that never mutes a test, and feedback regenerating a quantified, period-over-period KPI tracker driven by a real maintainer signal.

The second part, comparing what the workflows generate against the ground truth of the failures and KBEs that actually exist, is not implemented here. This iteration is minimal and only sets up the structure it will hang off: the spec layout, the report path, and the per-workflow fixture directories.

The next steps, in a follow-up, are the tooling that scrapes the ground truth, the real recorded failures and the KBEs and fixes they produced, and a comparison that scores the workflow output against it on the same inputs.

…rkflows

Stand up an on-demand Vally eval harness for the ci-failure-scan,
ci-failure-fix, and ci-failure-scan-feedback workflows. A maintainer
runs it from a PR comment to check whether a prompt change or model
switch keeps the output good before merging. Each spec runs the prompt
against a staged offline fixture and grades the artifact with static
regex checks plus binary LLM-judge dimensions at a 0.7 threshold.
Comparison against scraped ground truth is left as a follow-up; this
iteration only sets up the harness, spec layout, and fixture slots.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an on-demand GitHub Actions harness plus Vally eval specs to evaluate the three CI outer-loop agent workflows (scan / fix / feedback) against offline fixtures, producing a pass/fail report comment on the triggering PR.

Changes:

  • Adds ci-eval.yml, an issue_comment-triggered workflow that selects eval specs from /ci-* eval commands, runs vally lint + vally eval, and posts/updates a single report comment.
  • Adds three Vally eval specs defining fixture staging, prompts, and graders for scan/fix/feedback.
  • Adds Vally config + ignore rules, and placeholder fixture directories.
Show a summary per file
File Description
.github/workflows/ci-eval.yml New slash-command-driven eval runner workflow that executes Vally and posts a report comment back to the PR.
.github/workflows/evals/ci-failure-scan.eval.yaml Vally eval spec for the scanner workflow prompt + graders (file regex + LLM judges).
.github/workflows/evals/ci-failure-fix.eval.yaml Vally eval spec for the fixer workflow prompt + graders enforcing tier policy and “no muting”.
.github/workflows/evals/ci-failure-scan-feedback.eval.yaml Vally eval spec for the feedback/KPI tracker regeneration prompt + graders for grounding/completeness.
.github/workflows/evals/.vally.yaml Vally configuration pointing at the eval spec directory and filename pattern.
.github/workflows/evals/.gitignore Ignores Vally workspaces/output directories produced by local runs.
.github/workflows/evals/fixtures/scan/.gitkeep Placeholder to ensure the scan fixture directory exists in-repo.
.github/workflows/evals/fixtures/fix/.gitkeep Placeholder to ensure the fix fixture directory exists in-repo.
.github/workflows/evals/fixtures/feedback/.gitkeep Placeholder to ensure the feedback fixture directory exists in-repo.

Copilot's findings

  • Files reviewed: 6/9 changed files
  • Comments generated: 2

Comment on lines +7 to +11
permissions:
contents: read
pull-requests: write
issues: read

Comment on lines +121 to +122
const { data: comments } = await github.rest.issues.listComments({ owner, repo, issue_number });
const existing = comments.find(c => c.body && c.body.includes(marker));

defaults:
judge_model: claude-opus-4.6
model: claude-opus-4.6

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to 4.7 or 4.8, 4.6 is discontinued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants