What "training the model" actually means¶
When leadership says "we'll have our clinicians train the model," that phrase covers a wide range of approaches with very different cost, complexity, and effect. Being precise is worth real money and real safety.
The spectrum, from most to least feasible in-house¶
| Approach | What it is | Affects model weights? | Realistic in-house? |
|---|---|---|---|
| Prompt engineering & version control | Iterating system prompts, instructions, and few-shot examples | No | Yes, start here |
| Few-shot example curation | Curated exemplar inputs/outputs inserted into prompts | No | Yes |
| RAG corpus curation | Clinicians select and version-control which protocols, papers, goal banks feed retrieval | No | Yes, high leverage |
| Output preference collection | Clinicians rate, rank, or correct outputs; data feeds future iteration | No (initially) | Yes |
| DPO / RLHF on collected preferences | Adjusts model weights based on rater preferences | Yes | Possible but expensive; limited turnkey availability for frontier models |
| Supervised fine-tuning (SFT) | Train on (input, ideal-output) pairs | Yes | Available for some models; significant data-curation cost |
| Full foundation-model fine-tune | Continued pretraining on domain data | Yes | Not realistic for most clinical orgs |
Where the value actually comes from¶
For most teams, the top three rows account for ~80% of the practical value:
- Prompt engineering and version control. Most clinical-quality issues are prompt issues. A well-iterated prompt outperforms a poorly fine-tuned model on most tasks.
- Few-shot examples. A handful of well-chosen exemplars can shape output more than weeks of fine-tuning data collection.
- RAG corpus curation. What the system retrieves shapes what it says. Clinical leadership owning the corpus is high-leverage and underrated.
When fine-tuning is actually warranted¶
Consider it when:
- You have a specific failure mode that prompt engineering and RAG can't fix (rare, in clinical generation).
- You have access to high-quality (input, output) pairs at scale, with clinical sign-off.
- You can afford the eval cost of a tuned model. Eval becomes harder, not easier, after fine-tuning, because you've changed the base distribution.
- You can manage the lifecycle: tuned models are tied to a base version and need re-tuning when the base changes.
If you're unsure, you don't need it yet.
Collecting preference data without fine-tuning¶
You can, and should, start collecting structured clinician feedback even if you have no near-term plans to fine-tune. Reasons:
- It informs prompt and corpus iteration.
- It surfaces patterns of edit and disagreement.
- It builds the dataset that might be used for future tuning.
- It demonstrates clinician engagement, which matters for governance and adoption.
What to capture:
- The original draft and the clinician's edited version (diff).
- Optional reason codes (factually wrong, missing element, wrong tone, prohibited language, etc.).
- Free-text rationale for nontrivial edits.
- Sign-off identity and timestamp.
Be honest with leadership¶
If a leader says "our clinicians are training the model," ask: which row in the table above? If the honest answer is "row 1" (prompt engineering), say so. The capability is real and valuable; the framing matters because it sets expectations for cost, timeline, and what kind of company you're building.
Be honest with payers and patients¶
When describing the system in payer materials, marketing, or patient-facing language, use language that matches what's actually happening. "Clinician-curated AI" is honest. "Trained by clinicians" usually isn't, unless you've done supervised fine-tuning or RLHF.