Best AI passes half the chart 🧪, Epic's platform story wobbles 🏭, 124 papers built on slop 🗑️

May 23, 2026

Stanford’s PhysicianBench: The Best AI Completes Less Than Half of Real Clinical Work

Stanford ARISE just dropped PhysicianBench, a benchmark that does something most clinical AI evaluations don’t: it tests whether LLM agents can actually do physician work inside a real EHR environment.

Not multiple choice. Not clinical vignettes. The actual job.

One hundred long-horizon tasks across 21 specialties — 670 sub-checkpoints total. Each task requires an average of 27 tool calls: pulling labs, reading prior notes across encounters, reasoning over heterogeneous clinical data, executing orders, writing documentation. All via standard FHIR APIs against real patient records.

The best model (GPT-5.5) completed 46.3% of tasks on first attempt. Open-source models topped out at 19%.

That’s not a failure headline. That’s the most useful number in clinical AI right now. Because it tells you exactly where the human matters — in the other 54%.

The tasks that tripped up even the best models weren’t exotic. They were the bread-and-butter of clinical medicine: synthesizing information across multiple encounters, catching the medication that was discontinued three visits ago and restarted under a different name, knowing that the “normal” potassium in the 7 AM panel doesn’t match the critical one drawn at 2 AM that the overnight team already acted on.

😤 “46% is terrible. Why would anyone deploy this?” You’re reading it backwards. 46% of the rote work — pulling data, cross-referencing records, drafting documentation — is automatable right now. The question isn’t whether AI is ready to replace you. It’s whether you’re building the tools that handle the 46% so you can focus on the 54% that actually requires judgment.

😤 “Benchmarks don’t reflect real practice.” This one actually does. FHIR APIs, real patient records, multi-encounter reasoning. It’s not USMLE questions. Read the paper.

😤 “Open-source at 19% means local models are useless.” For autonomous EHR agents, yes, today. For focused tasks — summarizing a discharge note, flagging a drug interaction, generating a SOAP template — local models work fine. PhysicianBench measures the hardest version of the problem.

💡 80/20: The 46% number is your pitch deck’s most powerful slide. It proves AI has clinical value AND that clinical expertise is irreplaceable. Build for the junction — tools that handle the automatable 46% while surfacing the 54% that needs a human.

Epic’s AI Agents Are Real. The Platform Story Isn’t.

Adam Carewe, MD, published a critical analysis that separates what’s real from what’s marketing in Epic’s AI agent rollout.

The agents themselves — Art for clinicians, Penny for prior auth, Emmie for patients — are functional. Penny cut prior auth submission time by 42% at Summit Health. 85% of Epic customers are using some form of Epic AI.

But the “platform” narrative — that Agent Factory lets health systems build and orchestrate custom AI agents — doesn’t hold up under scrutiny.

The visual builder looks good in demos. The reality is more constrained: you’re building within Epic’s guardrails, with Epic’s data model, on Epic’s timeline. For a clinician-builder with a specific workflow problem, the question remains whether to build inside the walled garden or build something portable.

124 Clinical Prediction Papers Were Built on Fake Data

Two Kaggle datasets with zero data provenance — one for stroke prediction, one for diabetes — have been used to train 124 clinical prediction models published in peer-reviewed journals. At least two models built on this data are deployed in hospitals. One was cited in a medical device patent.

Retraction Watch reported the datasets contain images of Sylvester Stallone and Angelina Jolie mixed in with the “clinical” data. The research community generated 1,500 citations from datasets that can’t be verified as real.

😤 “Peer review should have caught this.” Peer review doesn’t audit datasets. It never has. If your product depends on published ML models, you need your own data provenance checks. That’s a feature, not a nice-to-have.

😤 “This is a Kaggle problem, not a clinical AI problem.” It’s a supply chain problem. Every model has a data lineage. If you can’t trace it back to real patients with real consent, you don’t know what you’re deploying.

Oura Files for IPO — First Pure Wearable Since Fitbit

Oura confidentially filed a Form S-1 with the SEC. The company was valued at $11 billion after a $900M Series E in October. On track for $2B in sales this year with nearly 5 million paid members.

This is the first pure-play consumer wearable IPO since Fitbit in 2015. And unlike Fitbit, Oura has been quietly building a healthcare play — 6x engagement lift in Medicare Advantage populations, partnerships with health systems for post-surgical monitoring.

The builder angle: bulk-export your Oura data now, before the API narrows post-IPO. Every wearable company tightens data access as it approaches public markets. If you’re building anything on ring data, establish your pipeline today.

😤 “Consumer wearables aren’t clinical tools.” They’re generating the data that clinical tools will need. The question is whether the data flows into your workflow or stays locked in an app.

💡 80/20: If you’re building with wearable data, Oura’s FHIR-adjacent APIs are the most builder-friendly in the market right now. Start prototyping before the IPO roadshow changes the access calculus.

The MRI Report Is Not a Diagnosis

Doug Fullington makes the case that the radiology report functions as a “fragmentation engine” — technically accurate findings that get interpreted out of context, driving unnecessary referrals and patient anxiety. The report is correct. The care it generates sometimes isn’t.

💡 80/20: There’s a product in the gap between “what the MRI report says” and “what the patient’s care team needs to know.” Contextualizing imaging findings within a patient’s active problem list is an LLM-shaped problem.

AI in Spine Surgery: Who Shapes the Tools?

Spinal Column’s new piece asks the question every surgical subspecialty will face: will surgeons actively build their AI tools or passively adopt whatever vendors ship?

💡 80/20: The answer to “will clinicians shape the tools?” is only yes if clinicians are building. The alternative is vendor-defined AI that optimizes for what’s measurable, not what matters.

Cursor Hits $2B ARR — The Coding Agent Shift Is Real

Cursor, the AI coding IDE, surpassed $2B in annualized recurring revenue with over 1 million daily active users. 30% of Cursor’s own merged pull requests are now created by background AI agents. The tool raised $3.4B total.

🎙️ From the Pods

🎙️ NEJM AI Grand Rounds — “The OpenEvidence Episode: Dr. Travis Zack”

OpenEvidence’s CMO revealed that their remaining hallucination problem isn’t fabricated references — that’s mostly solved. The hard part is models confabulating details from papers they have incomplete access to, and reasoning failures when synthesizing across multiple sources. 700K+ US clinicians use it monthly.

🎙️ The 229 Podcast — “The Front Door Is Wide Open”

Attackers aren’t breaking in through back doors anymore. They’re walking in with compromised credentials. Healthcare is “over-assessed and under-remediated” — organizations keep buying security assessment tools without closing the gaps they find.

🎙️ HIMSSCast — “Healthcare Without Borders”

Cross-border health data sharing fails not because of missing standards but because of semantic gaps — same clinical concept, coded differently. AI’s biggest practical contribution right now is reconciling heterogeneous clinical data across terminologies.

🧰 Builder’s Tip

Weekend Project: Run PhysicianBench on Your Own Specialty

PhysicianBench is open-source on GitHub. The benchmark tasks and evaluation harness are all public. This weekend, do three things:

Clone the repo and read 5-10 tasks in your specialty. See what “real clinical EHR work” looks like when it’s formalized as an eval.
Run the benchmark against a model you have access to (Claude, GPT, or a local model via Ollama). See where it passes and where it fails in your domain.
Write 3 tasks of your own — real cases from your last month of practice, formatted as PhysicianBench tasks. What did the AI miss that you caught?

All synthetic/de-identified data. No PHI. No BAA needed. You’ll finish Sunday with three things: a working knowledge of how clinical AI evals work, a set of specialty-specific test cases you can bring to your innovation team, and a concrete opinion about where AI helps and where it doesn’t in your practice.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?