[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087
Draft
kotlarmilos wants to merge 1 commit into
Draft
[ci-scan] Add evaluation infrastructure for the CI failure agentic workflows#130087kotlarmilos wants to merge 1 commit into
kotlarmilos wants to merge 1 commit into
Conversation
…rkflows Stand up an on-demand Vally eval harness for the ci-failure-scan, ci-failure-fix, and ci-failure-scan-feedback workflows. A maintainer runs it from a PR comment to check whether a prompt change or model switch keeps the output good before merging. Each spec runs the prompt against a staged offline fixture and grades the artifact with static regex checks plus binary LLM-judge dimensions at a 0.7 threshold. Comparison against scraped ground truth is left as a follow-up; this iteration only sets up the harness, spec layout, and fixture slots. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Tagging subscribers to this area: @dotnet/runtime-infrastructure |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an on-demand GitHub Actions harness plus Vally eval specs to evaluate the three CI outer-loop agent workflows (scan / fix / feedback) against offline fixtures, producing a pass/fail report comment on the triggering PR.
Changes:
- Adds
ci-eval.yml, anissue_comment-triggered workflow that selects eval specs from/ci-* evalcommands, runsvally lint+vally eval, and posts/updates a single report comment. - Adds three Vally eval specs defining fixture staging, prompts, and graders for scan/fix/feedback.
- Adds Vally config + ignore rules, and placeholder fixture directories.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci-eval.yml | New slash-command-driven eval runner workflow that executes Vally and posts a report comment back to the PR. |
| .github/workflows/evals/ci-failure-scan.eval.yaml | Vally eval spec for the scanner workflow prompt + graders (file regex + LLM judges). |
| .github/workflows/evals/ci-failure-fix.eval.yaml | Vally eval spec for the fixer workflow prompt + graders enforcing tier policy and “no muting”. |
| .github/workflows/evals/ci-failure-scan-feedback.eval.yaml | Vally eval spec for the feedback/KPI tracker regeneration prompt + graders for grounding/completeness. |
| .github/workflows/evals/.vally.yaml | Vally configuration pointing at the eval spec directory and filename pattern. |
| .github/workflows/evals/.gitignore | Ignores Vally workspaces/output directories produced by local runs. |
| .github/workflows/evals/fixtures/scan/.gitkeep | Placeholder to ensure the scan fixture directory exists in-repo. |
| .github/workflows/evals/fixtures/fix/.gitkeep | Placeholder to ensure the fix fixture directory exists in-repo. |
| .github/workflows/evals/fixtures/feedback/.gitkeep | Placeholder to ensure the feedback fixture directory exists in-repo. |
Copilot's findings
- Files reviewed: 6/9 changed files
- Comments generated: 2
Comment on lines
+7
to
+11
| permissions: | ||
| contents: read | ||
| pull-requests: write | ||
| issues: read | ||
|
|
Comment on lines
+121
to
+122
| const { data: comments } = await github.rest.issues.listComments({ owner, repo, issue_number }); | ||
| const existing = comments.find(c => c.body && c.body.includes(marker)); |
vitek-karas
reviewed
Jul 1, 2026
|
|
||
| defaults: | ||
| judge_model: claude-opus-4.6 | ||
| model: claude-opus-4.6 |
Member
There was a problem hiding this comment.
Change to 4.7 or 4.8, 4.6 is discontinued.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR stands up the infrastructure for evaluating the three CI outer-loop agentic workflows,
ci-failure-scan,ci-failure-fix, andci-failure-scan-feedback, using Vally. It does not change the workflows themselves and does not wire the eval into required CI; it adds a harness a maintainer runs on demand.The intent is two-fold. The part this PR implements is a pre-merge check: when we change one of the prompts or switch the model behind it, a maintainer comments
/ci-scan eval,/ci-fix eval,/ci-feedback eval, or/ci-evalon the PR, andci-eval.ymlchecks out the PR head, runs the affected Vally spec against the Copilot-SDK executor, and posts a single pass/fail report back. Each spec runs its prompt against a staged offline fixture so the run is deterministic, then grades the artifact the agent produces with three static regex checks on the content and three binary LLM-judge dimensions, at a 0.7 threshold so a single grader can miss without failing the run. The graders look for substance rather than format alone: a scan producing a single stable KBE grounded in a log excerpt that exists in the fixture, a fix emitting a real PR shape or a reasoned comment that never mutes a test, and feedback regenerating a quantified, period-over-period KPI tracker driven by a real maintainer signal.The second part, comparing what the workflows generate against the ground truth of the failures and KBEs that actually exist, is not implemented here. This iteration is minimal and only sets up the structure it will hang off: the spec layout, the report path, and the per-workflow fixture directories.
The next steps, in a follow-up, are the tooling that scrapes the ground truth, the real recorded failures and the KBEs and fixes they produced, and a comparison that scores the workflow output against it on the same inputs.