Best AI passes half the chart đ§Ş, Epic's platform story wobbles đ, 124 papers built on slop đď¸
Stanfordâs PhysicianBench: The Best AI Completes Less Than Half of Real Clinical Work
Stanford ARISE just dropped PhysicianBench, a benchmark that does something most clinical AI evaluations donât: it tests whether LLM agents can actually do physician work inside a real EHR environment.
Not multiple choice. Not clinical vignettes. The actual job.
One hundred long-horizon tasks across 21 specialties â 670 sub-checkpoints total. Each task requires an average of 27 tool calls: pulling labs, reading prior notes across encounters, reasoning over heterogeneous clinical data, executing orders, writing documentation. All via standard FHIR APIs against real patient records.
The best model (GPT-5.5) completed 46.3% of tasks on first attempt. Open-source models topped out at 19%.
Thatâs not a failure headline. Thatâs the most useful number in clinical AI right now. Because it tells you exactly where the human matters â in the other 54%.
The tasks that tripped up even the best models werenât exotic. They were the bread-and-butter of clinical medicine: synthesizing information across multiple encounters, catching the medication that was discontinued three visits ago and restarted under a different name, knowing that the ânormalâ potassium in the 7 AM panel doesnât match the critical one drawn at 2 AM that the overnight team already acted on.
đ¤ â46% is terrible. Why would anyone deploy this?â Youâre reading it backwards. 46% of the rote work â pulling data, cross-referencing records, drafting documentation â is automatable right now. The question isnât whether AI is ready to replace you. Itâs whether youâre building the tools that handle the 46% so you can focus on the 54% that actually requires judgment.
đ¤ âBenchmarks donât reflect real practice.â This one actually does. FHIR APIs, real patient records, multi-encounter reasoning. Itâs not USMLE questions. Read the paper.
đ¤ âOpen-source at 19% means local models are useless.â For autonomous EHR agents, yes, today. For focused tasks â summarizing a discharge note, flagging a drug interaction, generating a SOAP template â local models work fine. PhysicianBench measures the hardest version of the problem.
đĄ 80/20: The 46% number is your pitch deckâs most powerful slide. It proves AI has clinical value AND that clinical expertise is irreplaceable. Build for the junction â tools that handle the automatable 46% while surfacing the 54% that needs a human.
Epicâs AI Agents Are Real. The Platform Story Isnât.
Adam Carewe, MD, published a critical analysis that separates whatâs real from whatâs marketing in Epicâs AI agent rollout.
The agents themselves â Art for clinicians, Penny for prior auth, Emmie for patients â are functional. Penny cut prior auth submission time by 42% at Summit Health. 85% of Epic customers are using some form of Epic AI.
But the âplatformâ narrative â that Agent Factory lets health systems build and orchestrate custom AI agents â doesnât hold up under scrutiny.
The visual builder looks good in demos. The reality is more constrained: youâre building within Epicâs guardrails, with Epicâs data model, on Epicâs timeline. For a clinician-builder with a specific workflow problem, the question remains whether to build inside the walled garden or build something portable.
124 Clinical Prediction Papers Were Built on Fake Data
Two Kaggle datasets with zero data provenance â one for stroke prediction, one for diabetes â have been used to train 124 clinical prediction models published in peer-reviewed journals. At least two models built on this data are deployed in hospitals. One was cited in a medical device patent.
Retraction Watch reported the datasets contain images of Sylvester Stallone and Angelina Jolie mixed in with the âclinicalâ data. The research community generated 1,500 citations from datasets that canât be verified as real.
đ¤ âPeer review should have caught this.â Peer review doesnât audit datasets. It never has. If your product depends on published ML models, you need your own data provenance checks. Thatâs a feature, not a nice-to-have.
đ¤ âThis is a Kaggle problem, not a clinical AI problem.â Itâs a supply chain problem. Every model has a data lineage. If you canât trace it back to real patients with real consent, you donât know what youâre deploying.
Oura Files for IPO â First Pure Wearable Since Fitbit
Oura confidentially filed a Form S-1 with the SEC. The company was valued at $11 billion after a $900M Series E in October. On track for $2B in sales this year with nearly 5 million paid members.
This is the first pure-play consumer wearable IPO since Fitbit in 2015. And unlike Fitbit, Oura has been quietly building a healthcare play â 6x engagement lift in Medicare Advantage populations, partnerships with health systems for post-surgical monitoring.
The builder angle: bulk-export your Oura data now, before the API narrows post-IPO. Every wearable company tightens data access as it approaches public markets. If youâre building anything on ring data, establish your pipeline today.
đ¤ âConsumer wearables arenât clinical tools.â Theyâre generating the data that clinical tools will need. The question is whether the data flows into your workflow or stays locked in an app.
đĄ 80/20: If youâre building with wearable data, Ouraâs FHIR-adjacent APIs are the most builder-friendly in the market right now. Start prototyping before the IPO roadshow changes the access calculus.
The MRI Report Is Not a Diagnosis
Doug Fullington makes the case that the radiology report functions as a âfragmentation engineâ â technically accurate findings that get interpreted out of context, driving unnecessary referrals and patient anxiety. The report is correct. The care it generates sometimes isnât.
đĄ 80/20: Thereâs a product in the gap between âwhat the MRI report saysâ and âwhat the patientâs care team needs to know.â Contextualizing imaging findings within a patientâs active problem list is an LLM-shaped problem.
AI in Spine Surgery: Who Shapes the Tools?
Spinal Columnâs new piece asks the question every surgical subspecialty will face: will surgeons actively build their AI tools or passively adopt whatever vendors ship?
đĄ 80/20: The answer to âwill clinicians shape the tools?â is only yes if clinicians are building. The alternative is vendor-defined AI that optimizes for whatâs measurable, not what matters.
Cursor Hits $2B ARR â The Coding Agent Shift Is Real
Cursor, the AI coding IDE, surpassed $2B in annualized recurring revenue with over 1 million daily active users. 30% of Cursorâs own merged pull requests are now created by background AI agents. The tool raised $3.4B total.
đď¸ From the Pods
đď¸ NEJM AI Grand Rounds â âThe OpenEvidence Episode: Dr. Travis Zackâ
OpenEvidenceâs CMO revealed that their remaining hallucination problem isnât fabricated references â thatâs mostly solved. The hard part is models confabulating details from papers they have incomplete access to, and reasoning failures when synthesizing across multiple sources. 700K+ US clinicians use it monthly.
đď¸ The 229 Podcast â âThe Front Door Is Wide Openâ
Attackers arenât breaking in through back doors anymore. Theyâre walking in with compromised credentials. Healthcare is âover-assessed and under-remediatedâ â organizations keep buying security assessment tools without closing the gaps they find.
đď¸ HIMSSCast â âHealthcare Without Bordersâ
Cross-border health data sharing fails not because of missing standards but because of semantic gaps â same clinical concept, coded differently. AIâs biggest practical contribution right now is reconciling heterogeneous clinical data across terminologies.
đ§° Builderâs Tip
Weekend Project: Run PhysicianBench on Your Own Specialty
PhysicianBench is open-source on GitHub. The benchmark tasks and evaluation harness are all public. This weekend, do three things:
Clone the repo and read 5-10 tasks in your specialty. See what âreal clinical EHR workâ looks like when itâs formalized as an eval.
Run the benchmark against a model you have access to (Claude, GPT, or a local model via Ollama). See where it passes and where it fails in your domain.
Write 3 tasks of your own â real cases from your last month of practice, formatted as PhysicianBench tasks. What did the AI miss that you caught?
All synthetic/de-identified data. No PHI. No BAA needed. Youâll finish Sunday with three things: a working knowledge of how clinical AI evals work, a set of specialty-specific test cases you can bring to your innovation team, and a concrete opinion about where AI helps and where it doesnât in your practice.
What are you building this week? Reply and tell me â I read every one.
â Kevin


