[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows by kotlarmilos · Pull Request #130087 · dotnet/runtime

kotlarmilos · 2026-07-01T15:50:25Z

Description

This PR stands up the infrastructure for evaluating the three CI outer-loop agentic workflows, ci-failure-scan, ci-failure-fix, and ci-failure-scan-feedback, using Vally. It does not change the workflows themselves and does not wire the eval into required CI; it adds a harness a maintainer runs on demand.

The intent is two-fold. The part this PR implements is a pre-merge check: when we change one of the prompts or switch the model behind it, a maintainer comments /ci-scan eval, /ci-fix eval, /ci-feedback eval, or /ci-eval on the PR, and ci-eval.yml checks out the PR head, runs the affected Vally spec against the Copilot-SDK executor, and posts a single pass/fail report back. Each spec runs its prompt against a staged offline fixture so the run is deterministic, then grades the artifact the agent produces with three static regex checks on the content and three binary LLM-judge dimensions, at a 0.7 threshold so a single grader can miss without failing the run. The graders look for substance rather than format alone: a scan producing a single stable KBE grounded in a log excerpt that exists in the fixture, a fix emitting a real PR shape or a reasoned comment that never mutes a test, and feedback regenerating a quantified, period-over-period KPI tracker driven by a real maintainer signal.

The second part, comparing what the workflows generate against the ground truth of the failures and KBEs that actually exist, is not implemented here. This iteration is minimal and only sets up the structure it will hang off: the spec layout, the report path, and the per-workflow fixture directories.

The next steps, in a follow-up, are the tooling that scrapes the ground truth, the real recorded failures and the KBEs and fixes they produced, and a comparison that scores the workflow output against it on the same inputs.

…rkflows Stand up an on-demand Vally eval harness for the ci-failure-scan, ci-failure-fix, and ci-failure-scan-feedback workflows. A maintainer runs it from a PR comment to check whether a prompt change or model switch keeps the output good before merging. Each spec runs the prompt against a staged offline fixture and grades the artifact with static regex checks plus binary LLM-judge dimensions at a 0.7 threshold. Comparison against scraped ground truth is left as a follow-up; this iteration only sets up the harness, spec layout, and fixture slots. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-07-01T15:51:42Z

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Adds an on-demand GitHub Actions harness plus Vally eval specs to evaluate the three CI outer-loop agent workflows (scan / fix / feedback) against offline fixtures, producing a pass/fail report comment on the triggering PR.

Changes:

Adds ci-eval.yml, an issue_comment-triggered workflow that selects eval specs from /ci-* eval commands, runs vally lint + vally eval, and posts/updates a single report comment.
Adds three Vally eval specs defining fixture staging, prompts, and graders for scan/fix/feedback.
Adds Vally config + ignore rules, and placeholder fixture directories.

Show a summary per file

File	Description
.github/workflows/ci-eval.yml	New slash-command-driven eval runner workflow that executes Vally and posts a report comment back to the PR.
.github/workflows/evals/ci-failure-scan.eval.yaml	Vally eval spec for the scanner workflow prompt + graders (file regex + LLM judges).
.github/workflows/evals/ci-failure-fix.eval.yaml	Vally eval spec for the fixer workflow prompt + graders enforcing tier policy and “no muting”.
.github/workflows/evals/ci-failure-scan-feedback.eval.yaml	Vally eval spec for the feedback/KPI tracker regeneration prompt + graders for grounding/completeness.
.github/workflows/evals/.vally.yaml	Vally configuration pointing at the eval spec directory and filename pattern.
.github/workflows/evals/.gitignore	Ignores Vally workspaces/output directories produced by local runs.
.github/workflows/evals/fixtures/scan/.gitkeep	Placeholder to ensure the scan fixture directory exists in-repo.
.github/workflows/evals/fixtures/fix/.gitkeep	Placeholder to ensure the fix fixture directory exists in-repo.
.github/workflows/evals/fixtures/feedback/.gitkeep	Placeholder to ensure the feedback fixture directory exists in-repo.

Copilot's findings

Files reviewed: 6/9 changed files
Comments generated: 2

+permissions:
+  contents: read
+  pull-requests: write
+  issues: read
+


+            const { data: comments } = await github.rest.issues.listComments({ owner, repo, issue_number });
+            const existing = comments.find(c => c.body && c.body.includes(marker));


vitek-karas · 2026-07-01T20:07:47Z

+
+defaults:
+  judge_model: claude-opus-4.6
+  model: claude-opus-4.6


Change to 4.7 or 4.8, 4.6 is discontinued.

Copilot AI review requested due to automatic review settings July 1, 2026 15:50

github-actions Bot added the area-Infrastructure label Jul 1, 2026

github-project-automation Bot added this to Runtime Infra Jul 1, 2026

Copilot started reviewing on behalf of kotlarmilos July 1, 2026 15:50 View session

dotnet-policy-service Bot assigned kotlarmilos Jul 1, 2026

Copilot AI reviewed Jul 1, 2026

View reviewed changes

kotlarmilos requested review from JanKrivanek, PureWeen, ViktorHofer and vitek-karas July 1, 2026 16:00

vitek-karas reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087

[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087
kotlarmilos wants to merge 1 commit into
dotnet:mainfrom
kotlarmilos:kotlarmilos/ci-workflow-evals

kotlarmilos commented Jul 1, 2026 •

edited

Loading

Uh oh!

dotnet-policy-service Bot commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

vitek-karas Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const { data: comments } = await github.rest.issues.listComments({ owner, repo, issue_number });
		const existing = comments.find(c => c.body && c.body.includes(marker));

Uh oh!

Conversation

kotlarmilos commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

dotnet-policy-service Bot commented Jul 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

vitek-karas Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kotlarmilos commented Jul 1, 2026 •

edited

Loading