Evaluation & Quality on MinimumCD Practice Guide

AI Eval Methodology for Coding Tools

Mon, 01 Jan 0001 00:00:00 +0000

AI coding tools produce non-deterministic output. Evals make that output observable and measurable using three grading layers: deterministic checks, transcript analysis, and LLM rubrics.

This guide is for teams building AI coding tools and platform teams providing shared AI enablement infrastructure. For team-specific eval setup, see Team AI Evals. For platform-scale patterns, see AI Evals for AI Enablement Platforms.

Terminology

Term	Definition
Task	A single work item given to the agent (one prompt + one fixture)
Trial	One execution of a task; multiple trials measure variance
Grader	An automated check that scores agent output (pass/fail or 0-1)
Transcript	The full agent conversation log: tool calls, reasoning, output
Outcome	The agent’s final output for a task
Evaluation harness	The framework that runs tasks, collects outcomes, applies graders
Agent harness	The runtime that executes the agent (e.g., Claude Code)
Evaluation suite	A collection of related tasks testing one capability dimension

In the dev-plugins reference implementation: Promptfoo is the evaluation harness. Claude Code is the agent harness. YAML files in evals/<plugin>/suites/ are evaluation suites.

Team AI Evals for Coding Tools

Mon, 01 Jan 0001 00:00:00 +0000

If you would notice a regression, it needs an eval. This page covers setting up eval infrastructure, writing your first positive and negative tests, choosing graders, and integrating evals into your pipeline.

Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.

What Needs Evals

Not every AI interaction needs an eval. Use this heuristic: if you would notice a regression, it needs an eval.

Artifacts that need evals:

AI Evals for AI Enablement Platforms

Mon, 01 Jan 0001 00:00:00 +0000

Platform teams build reusable AI coding tools for multiple teams. Shared eval infrastructure (base configs, grader libraries, rubric templates) eliminates duplication and enforces consistency across the plugin portfolio.

Reference implementation: The dev-plugins repository demonstrates these patterns with Promptfoo, Claude Code, and custom graders.

What is an AI Enablement Platform

An AI enablement platform is the team that builds reusable AI coding tools (prompts, agents, plugins, skills) for multiple teams in an organization. Instead of every team writing their own code review agent or scaffolding command, the platform team builds these once and distributes them.