Writings
Article

Hypothesis - Defensibility in Clinical AI

active
Title

Overview

Most of the early healthcare AI investment dollars have gone to ambient scribes and back-end automation. Clinical AI, supporting and automating the patient <> provider interaction, is undergoing a similar wave.

From the early startups we've seen in the space, defensible value in clinical AI is still ultimately a data access question, just one that differs based on the task or specific question (table below). At lower levels, the "data" is clinical workflow knowledge (what to retrieve, when, in what context). At upper levels, it's longitudinal patient records and other patient data (CT scans, radiology reports, ‘omics, etc).

OpenEvidence; HealthBench; MedPI; AMIE; Jiang; EHRSHOT; Epic CoMET; SMB-Structure; NEP

These levels aren't a difficulty gradient per se, but categorically different problems requiring different data and different approaches. At Level 2, an agentic RAG system beat raw GPT-5 on HealthBench's hardest clinical cases by engineering better retrieval, not by using a better model (OpenAI’s HealthBench in Action). When a triage study forced ChatGPT into single-shot answers, it looked dangerous. After OpenAI re-ran the same dataset with multi-turn conversation, the model asked clarifying questions in 80%+ of cases (Singhal, Mar 2026). The interaction design mattered more than the model. At Level 5, it's a different story. Both Epic's autoregressive transformer and SMB's JEPA architecture achieve strong performance on trajectory prediction. What matters isn't the architecture but who controls the structured longitudinal data.

At lower levels, the winners will be the companies best embedded in clinical workflow. At upper levels, the winners will be whoever controls the institutional data.

What This Means Going Forward

Going forward, we believe:

Domain expertise and context engineering drive differentiation at lower levels. MedAgentBench showed clinical AI agents jumping from 69.7% to 91-98% accuracy through prompt engineering, tool design, and memory alone (MedAgentBench, 2026). DR. INFO beat GPT-5 on HealthBench Hard the same way (OpenAI’s HealthBench in Action). Google's AMIE prospective trial reinforces this dynamic. 90% diagnostic accuracy came not from a better model but from a workflow where AI gathers history, generates a differential, and hands a summary to the PCP who then verifies rather than discovers (AMIE BIDMC Study, 2026).

Data representation, not architecture, is the binding constraint at upper levels. Epic's CoMET is a decoder-only autoregressive transformer (architecturally a standard LLM) that reaches 0.925 AUROC on CHF by tokenizing structured medical events (ICD codes, labs, medications) instead of natural language. SMB-Structure (Standard Model Bio) uses JEPA and captures trajectory dynamics that autoregressive baselines miss. The real differentiator isn't autoregressive vs. JEPA but whether you're predicting the next word or the next clinical event. Natural language token prediction breaks at multi-turn clinical reasoning (Level 3). Structured medical event prediction works for trajectory modeling (Level 5). T hebetter question isn’t around which architecture as things move so quickly, but who controls the structured clinical data and how it gets tokenized.

Data access matters but the scale threshold may be lower than assumed. Epic runs on 115B events from 118M patients across 1,760+ hospitals and also owns the workflow (physician views, MyChart, operational dashboards). You don't need Epic-level scale. SMB's published paper validated trajectory prediction across 23K MSK oncology patients and 19K PE patients with open weights on HuggingFace, and their broader platform trains 1.2M patients across 15 indications (~100x smaller than Epic's Cosmos dataset). 

Today's benchmarks measure what models can do in controlled, verifiable settings, and often are looking at models released 3-6+ months ago. They don't tell the health system whether Model A or Model B will produce better outcomes for their patients. This evaluation gap will start to close, but clinical outcomes are fundamentally hard to verify. You can't rerun a patient's trajectory, results take months to prove out, and treatments change based on the predictions themselves. Independent evaluation faces the same data access problem that creates model defensibility in the first place. Our bet is that evaluation ends up owned by the vertical AI companies who can validate against their own patient populations, not a 3rd party horizontal infrastructure for governance.

Additional Reading

ARISE State of Clinical AI Report (2026)

METR — Model Evaluation & Threat Research

HealthBench: Not yet clinically ready (PMC)

Bessemer State of Health AI 2026

Vals AI MedQA Leaderboard (2026)

NOHARM/MAST Benchmark — Stanford-Harvard ARISE (2026)

Script Concordance Testing — NEJM AI (2026)

Stanford MedHELM (35 benchmarks, 121 clinical tasks)

MedMarks (20 benchmarks, 46 models)

Blog

Related Articles