Evaluation¶

How to know whether the system you've built is actually safe and useful, before deployment, and continuously after.

Methodology: the eval stack: rubrics, non-inferiority, adversarial sets, shadow mode.
Bias & subgroup evaluation: aggregate metrics hide subgroup harm; here's how to surface it.
Ongoing monitoring & drift: what to watch for after launch.
Red-teaming: adversarial testing as a discipline, not a one-off.
Minimum criteria and rollback plans: the thresholds you commit to before launch, and the kill switch you commit to before patients depend on the system.