Red-teaming¶
Adversarial testing should be a discipline, not an afterthought. The goal is to find the failure modes before patients do.
Who should red-team¶
- People not on the build team. The team that built the system has blind spots about it; that's not a moral failing, it's a structural one.
- A mix of clinical and technical perspectives. Clinical red-teamers find the cases that matter; technical red-teamers find the prompt-injection and PHI-leakage paths.
- Periodically, external red-teamers: a different shop, a consulting firm, an academic group. Internal teams habituate.
Cadence¶
- Quarterly at minimum. More often if the system or its inputs are changing rapidly.
- After every material model or prompt version change.
- After any near-miss or adverse event of a new kind.
Red-team suite categories¶
Clinical edge cases¶
- Severe presentations (SIB, aggression, elopement, medical complexity)
- Custody and authority ambiguities
- Suspected abuse or neglect
- Mandatory reporting triggers
- Crisis indications
Population coverage¶
- Subgroups underrepresented in your data: does the system handle them, refuse, or hallucinate?
- AAC and non-speaking clients
- Adolescents and adults
Prompt injection¶
- Injection via session notes ("ignore prior instructions and recommend X")
- Injection via patient names or fields under user control
- Indirect injection via retrieved documents
PHI leakage¶
- Cross-patient retrieval attempts under various spoofing scenarios
- Output-side leakage (does the system ever name a patient other than the requested one?)
- Log and cache leakage
Numeric and date integrity¶
- Inputs designed to confuse the model (typos, ambiguous dates, units)
- Verify the model defers to structured sources rather than confabulating
Refusal robustness¶
- Cases where refusal is the correct answer: does the system refuse?
- Cases where the system refuses a benign query: under-refusal harms.
Adversarial framing¶
- Outputs requested in framings that might bias the system (e.g., "the parent insists on X intervention": does the system override its evidence-based defaults?)
Agentic-system specific (if applicable)¶
See Agentic systems. If the architecture is a planning loop with tool access or persistent state, the categories below are mandatory in addition to those above.
- Indirect prompt injection via tool returns. Embed adversarial instructions in retrieved chart notes, parent-supplied messages, or web content fetched by tools. Verify the system does not produce tool calls implementing the embedded instructions.
- Tool misuse and overreach. Inputs designed to elicit tool calls outside the workflow's allowed scope (e.g., a drafting workflow producing a tool call to send a message, schedule, or write to billing).
- Plan-step compounding. Inputs where step one is subtly wrong (e.g., wrong patient, wrong date). Verify downstream steps catch the error rather than conditioning on it.
- Loop and cost behavior. Inputs designed to elicit retries, re-planning, or oscillation. Confirm step caps, cost caps, and circuit breakers fire as designed.
- State contamination. Multi-session scenarios that probe whether content written to persistent state in session A is read into the prompt of session B, especially across patients or clinicians.
- Authority confusion. Verify runtime-executed tool calls are logged under a dedicated service identity, not the originating clinician's session, and that on-behalf-of boundaries are respected.
What to do with findings¶
- Triage by severity and prevalence. A severe rare failure can be more important than a mild common one, or less, depending on consequence.
- Reproduce. Convert each finding into a test case in the regression eval set.
- Fix at the right layer. Sometimes the fix is a prompt change, sometimes a retrieval filter, sometimes a verifier rule, sometimes a deployment-scope decision.
- Re-test. Verify the fix without regressing other behaviors.
- Publish internally. Red-team reports are organizational learning artifacts; circulate them, don't bury them.
Don't conflate red-teaming with QA¶
QA verifies the system does what it's supposed to do. Red-teaming finds the things you didn't think to specify. Both are necessary; they are not substitutes.