<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Evaluation &amp; Quality on MinimumCD Practice Guide</title><link>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/</link><description>Recent content in Evaluation &amp; Quality on MinimumCD Practice Guide</description><generator>Hugo</generator><language>en</language><atom:link href="https://beyond.minimumcd.org/docs/agentic-cd/evaluation/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Eval Methodology for Coding Tools</title><link>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/ai-eval-methodology/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/ai-eval-methodology/</guid><description>&lt;div class="pageinfo pageinfo-primary"&gt;
&lt;p&gt;AI coding tools produce non-deterministic output. Evals make that output observable and measurable using three grading layers: deterministic checks, transcript analysis, and LLM rubrics.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;This guide is for teams building AI coding tools and platform teams providing
shared AI enablement infrastructure. For team-specific eval setup, see
&lt;a href="https://beyond.minimumcd.org/docs/agentic-cd/evaluation/team-ai-evals/"&gt;Team AI Evals&lt;/a&gt;. For platform-scale patterns, see
&lt;a href="https://beyond.minimumcd.org/docs/agentic-cd/evaluation/platform-ai-evals/"&gt;AI Evals for AI Enablement Platforms&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="terminology"&gt;Terminology&lt;/h2&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Term&lt;/th&gt;
 &lt;th&gt;Definition&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Task&lt;/td&gt;
 &lt;td&gt;A single work item given to the agent (one prompt + one fixture)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Trial&lt;/td&gt;
 &lt;td&gt;One execution of a task; multiple trials measure variance&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Grader&lt;/td&gt;
 &lt;td&gt;An automated check that scores agent output (pass/fail or 0-1)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Transcript&lt;/td&gt;
 &lt;td&gt;The full agent conversation log: tool calls, reasoning, output&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Outcome&lt;/td&gt;
 &lt;td&gt;The agent&amp;rsquo;s final output for a task&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Evaluation harness&lt;/td&gt;
 &lt;td&gt;The framework that runs tasks, collects outcomes, applies graders&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Agent harness&lt;/td&gt;
 &lt;td&gt;The runtime that executes the agent (e.g., Claude Code)&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Evaluation suite&lt;/td&gt;
 &lt;td&gt;A collection of related tasks testing one capability dimension&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;In the &lt;a href="https://github.com/bailejl/dev-plugins"&gt;dev-plugins&lt;/a&gt; reference implementation:&lt;/strong&gt; Promptfoo is the evaluation harness. Claude Code is the &lt;a href="https://beyond.minimumcd.org/docs/reference/glossary/#agent-ai"&gt;agent&lt;/a&gt;
harness. YAML files in &lt;code&gt;evals/&amp;lt;plugin&amp;gt;/suites/&lt;/code&gt; are evaluation suites.&lt;/p&gt;</description></item><item><title>Team AI Evals for Coding Tools</title><link>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/team-ai-evals/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/team-ai-evals/</guid><description>&lt;div class="pageinfo pageinfo-primary"&gt;
&lt;p&gt;If you would notice a regression, it needs an eval. This page covers setting up eval infrastructure, writing your first positive and negative tests, choosing graders, and integrating evals into your &lt;a href="https://beyond.minimumcd.org/docs/reference/glossary/#pipeline"&gt;pipeline&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reference implementation:&lt;/strong&gt; The &lt;a href="https://github.com/bailejl/dev-plugins"&gt;dev-plugins&lt;/a&gt; repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="what-needs-evals"&gt;What Needs Evals&lt;/h2&gt;
&lt;p&gt;Not every AI interaction needs an eval. Use this heuristic: &lt;strong&gt;if you would notice a
regression, it needs an eval.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Artifacts that need evals:&lt;/p&gt;</description></item><item><title>AI Evals for AI Enablement Platforms</title><link>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/platform-ai-evals/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://beyond.minimumcd.org/docs/agentic-cd/evaluation/platform-ai-evals/</guid><description>&lt;div class="pageinfo pageinfo-primary"&gt;
&lt;p&gt;Platform teams build reusable AI coding tools for multiple teams. Shared eval infrastructure (base configs, grader libraries, rubric templates) eliminates duplication and enforces consistency across the plugin portfolio.&lt;/p&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reference implementation:&lt;/strong&gt; The &lt;a href="https://github.com/bailejl/dev-plugins"&gt;dev-plugins&lt;/a&gt; repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="what-is-an-ai-enablement-platform"&gt;What is an AI Enablement Platform&lt;/h2&gt;
&lt;p&gt;An AI enablement platform is the team that builds reusable AI coding tools (&lt;a href="https://beyond.minimumcd.org/docs/reference/glossary/#prompt"&gt;prompts&lt;/a&gt;,
&lt;a href="https://beyond.minimumcd.org/docs/reference/glossary/#agent-ai"&gt;agents&lt;/a&gt;, plugins, skills) for multiple teams in an organization. Instead of every team
writing their own code review agent or scaffolding command, the platform team builds
these once and distributes them.&lt;/p&gt;</description></item></channel></rss>