Evaluation¶
How to know whether the system you've built is actually safe and useful, before deployment, and continuously after.
- Methodology: the eval stack: rubrics, non-inferiority, adversarial sets, shadow mode.
- Bias & subgroup evaluation: aggregate metrics hide subgroup harm; here's how to surface it.
- Ongoing monitoring & drift: what to watch for after launch.
- Red-teaming: adversarial testing as a discipline, not a one-off.
- Minimum criteria and rollback plans: the thresholds you commit to before launch, and the kill switch you commit to before patients depend on the system.