commands/eval.md

# Eval Command

Manage eval-driven development workflow.

## Usage

`/eval [define|check|report|list] [feature-name]`

## Define Evals

`/eval define feature-name`

Create a new eval definition:

1. Create `.claude/evals/feature-name.md` with template:

```markdown
## EVAL: feature-name
Created: $(date)

### Capability Evals
- [ ] [Description of capability 1]
- [ ] [Description of capability 2]

### Regression Evals
- [ ] [Existing behavior 1 still works]
- [ ] [Existing behavior 2 still works]

### Success Criteria
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
```

2. Prompt user to fill in specific criteria

## Check Evals

`/eval check feature-name`

Run evals for a feature:

1. Read eval definition from `.claude/evals/feature-name.md`
2. For each capability eval:
   - Attempt to verify criterion
   - Record PASS/FAIL
   - Log attempt in `.claude/evals/feature-name.log`
3. For each regression eval:
   - Run relevant tests
   - Compare against baseline
   - Record PASS/FAIL
4. Report current status:

```
EVAL CHECK: feature-name
========================
Capability: X/Y passing
Regression: X/Y passing
Status: IN PROGRESS / READY
```

## Report Evals

`/eval report feature-name`

Generate comprehensive eval report:

```
EVAL REPORT: feature-name
=========================
Generated: $(date)

CAPABILITY EVALS
----------------
[eval-1]: PASS (pass@1)
[eval-2]: PASS (pass@2) - required retry
[eval-3]: FAIL - see notes

REGRESSION EVALS
----------------
[test-1]: PASS
[test-2]: PASS
[test-3]: PASS

METRICS
-------
Capability pass@1: 67%
Capability pass@3: 100%
Regression pass^3: 100%

NOTES
-----
[Any issues, edge cases, or observations]

RECOMMENDATION
--------------
[SHIP / NEEDS WORK / BLOCKED]
```

## List Evals

`/eval list`

Show all eval definitions:

```
EVAL DEFINITIONS
================
feature-auth      [3/5 passing] IN PROGRESS
feature-search    [5/5 passing] READY
feature-export    [0/4 passing] NOT STARTED
```

## Arguments

$ARGUMENTS:
- `define <name>` - Create new eval definition
- `check <name>` - Run and check evals
- `report <name>` - Generate full report
- `list` - Show all evals
- `clean` - Remove old eval logs (keeps last 10 runs)
feat: package as Claude Code plugin with marketplace distribution - Add .claude-plugin/plugin.json manifest for direct installation - Add .claude-plugin/marketplace.json for marketplace distribution - Reorganize skills to proper skill-name/SKILL.md format - Update hooks.json with relative paths for portability - Add new skills: continuous-learning, strategic-compact, eval-harness, verification-loop - Add new commands: checkpoint, eval, orchestrate, verify - Update README with plugin installation instructions Install via: /plugin marketplace add affaan-m/everything-claude-code /plugin install everything-claude-code@everything-claude-code 2026-01-22 04:16:39 -08:00			`# Eval Command`

			`Manage eval-driven development workflow.`

			`## Usage`

			`/eval [define\|check\|report\|list] [feature-name]`

			`## Define Evals`

			`/eval define feature-name`

			`Create a new eval definition:`

			1. Create `.claude/evals/feature-name.md` with template:

			```markdown
			`## EVAL: feature-name`
			`Created: $(date)`

			`### Capability Evals`
			`- [ ] [Description of capability 1]`
			`- [ ] [Description of capability 2]`

			`### Regression Evals`
			`- [ ] [Existing behavior 1 still works]`
			`- [ ] [Existing behavior 2 still works]`

			`### Success Criteria`
			`- pass@3 > 90% for capability evals`
			`- pass^3 = 100% for regression evals`
			```

			`2. Prompt user to fill in specific criteria`

			`## Check Evals`

			`/eval check feature-name`

			`Run evals for a feature:`

			1. Read eval definition from `.claude/evals/feature-name.md`
			`2. For each capability eval:`
			`- Attempt to verify criterion`
			`- Record PASS/FAIL`
			- Log attempt in `.claude/evals/feature-name.log`
			`3. For each regression eval:`
			`- Run relevant tests`
			`- Compare against baseline`
			`- Record PASS/FAIL`
			`4. Report current status:`

			```
			`EVAL CHECK: feature-name`
			`========================`
			`Capability: X/Y passing`
			`Regression: X/Y passing`
			`Status: IN PROGRESS / READY`
			```

			`## Report Evals`

			`/eval report feature-name`

			`Generate comprehensive eval report:`

			```
			`EVAL REPORT: feature-name`
			`=========================`
			`Generated: $(date)`

			`CAPABILITY EVALS`
			`----------------`
			`[eval-1]: PASS (pass@1)`
			`[eval-2]: PASS (pass@2) - required retry`
			`[eval-3]: FAIL - see notes`

			`REGRESSION EVALS`
			`----------------`
			`[test-1]: PASS`
			`[test-2]: PASS`
			`[test-3]: PASS`

			`METRICS`
			`-------`
			`Capability pass@1: 67%`
			`Capability pass@3: 100%`
			`Regression pass^3: 100%`

			`NOTES`
			`-----`
			`[Any issues, edge cases, or observations]`

			`RECOMMENDATION`
			`--------------`
			`[SHIP / NEEDS WORK / BLOCKED]`
			```

			`## List Evals`

			`/eval list`

			`Show all eval definitions:`

			```
			`EVAL DEFINITIONS`
			`================`
			`feature-auth [3/5 passing] IN PROGRESS`
			`feature-search [5/5 passing] READY`
			`feature-export [0/4 passing] NOT STARTED`
			```

			`## Arguments`

			`$ARGUMENTS:`
			- `define <name>` - Create new eval definition
			- `check <name>` - Run and check evals
			- `report <name>` - Generate full report
			- `list` - Show all evals
			- `clean` - Remove old eval logs (keeps last 10 runs)