AI's real bug is the data 🧱, OpenEvidence under the hood 🔍, NVIDIA opens the floodgates? 🔓

Jun 02, 2026

Healthcare AI Doesn’t Have a Model Problem. It Has a Data Problem — and That’s Your Opening.

John Lee (an emergency physician and Epic master) made the argument cleanly this week: the thing breaking healthcare AI isn’t the model. It’s the garbage underneath it.

His example: a “Dr. Smith’s urology procedure” code, invented for one clinic’s operational convenience, now sitting in the chart as if it meant something real.

He maps it to a fault line. There’s the standards layer everyone designs for — SNOMED, LOINC, RxNorm — and the messy implementation layer where local codes, synonyms, and homonyms actually live. The shiny AI gets deployed at the top. The rot is at the bottom.

The same LLMs everyone is bolting onto the top of the stack are the best tool ever built for the boring job at the bottom: normalizing that broken semantic layer back to the standard.

And here’s the part that’s yours. A model can map a code — but it can’t tell you the code is wrong. Only the clinician who lived the workflow knows “Dr. Smith’s procedure” is junk.

That’s the domain-expertise moat in one sentence: agentic AI made writing the software cheap, so the binding constraint moved to “can you tell whether it’s right.” At the data layer, that oracle is a person who’s been in the room.

😤 “Data cleanup is a 20-year-old problem nobody’s solved.” Right — because we tried to solve it by making clinicians the unpaid normalization workforce, the most expensive clerical job in history, and the data stayed wrong anyway.

😤 “This is just terminology mapping with extra steps.” Go run it on your own problem list and tell me it’s solved.

😤 “Boring infrastructure never wins the budget.” True, and that’s the whole tragedy — it loses to a shiny Layer-1 demo every quarter. Which is also exactly why it’s wide open: nobody glamorous is fighting you for it.

“No Hallucinations” Is a Marketing Claim, Not an Architecture

A skeptical technical teardown reconstructs what’s likely under OpenEvidence’s hood: a frontier base model, LoRA fine-tuning adapters, and a RAG system over ~35 million papers with a reranker.

The sharpest point: “no hallucinations” is architecturally impossible. Every confident marketing line is a checkable architecture claim, and the “I don’t know” behavior can only live in a few specific places in the stack.

When a clinical tool promises zero hallucinations, it’s describing its ambition, not its architecture.

😤 “It works great in clinic — who cares how it’s built?” The day it’s wrong about your patient, you’ll care exactly how it’s built.

💡 80/20: For any clinical answer engine, ask the vendor one question: “What, mechanically, makes it say ‘I don’t know’?” If they can’t point to where that lives, it doesn’t.

The “AI Out-Reasons Doctors” Study, Read Carefully

An FDA AI advisor and practicing PCP re-read the Science study where OpenAI’s o1 beat physicians across five experiments and fooled evaluators in over 83% of cases.

His catch: the model only entered after a human had already gathered the history, the exam, and the labs. That’s not clinical reasoning — it’s differential diagnosis on a pre-solved case. The real reasoning is deciding what to ask, examine, and order, and the study skipped it. On the hardest real-world cases, the AI’s edge shrank to non-significant.

“Beats doctors at reasoning” really means “beats doctors at the easy half, after a human did the hard half.”

What we measured isn’t the reasoning itself — it’s the reasoning after our method already framed the case. That gap is the interesting part, not a footnote.

💡 80/20: If you’re building diagnostic AI, the moat isn’t the answer on a tidy vignette. It’s the information-gathering loop — what to ask next — that nobody is benchmarking yet.

Epic Is a Platform. So Is Apple. Only One Leaves the Door Open.

A former CMIO argues Epic should be read like Apple — a platform that extracts value and “Sherlocks” features out from under its ecosystem. The difference is what’s left for builders.

Apple takes a contested 30% toll but leaves a wide-open market. Epic leaves a high floor — contracts, security reviews, IT committees, Foundation deploys negotiated per customer — and, because its own bylaws bar it from acquiring, no exit for the vendors it absorbs. Its Agent Factory is platform-shaped tooling, not a developer platform. The proof points are real, though: Sutter is live with Ask Emmie in MyChart, Rush cut billing customer-service messages 58%, Summit trimmed prior-auth submission time 42%.

Building “on Epic” isn’t building on a platform — it’s renting a room with no key and no way out.

😤 “Then why does everyone integrate with Epic?” Because that’s where the patients are. But distribution isn’t opportunity — a high API count plus a contract negotiation is not a developer experience.

The Next Interoperability Quest Is Proxy Access

With direct patient data access finally maturing, the unsolved frontier is proxy access — letting a parent, an adult child, or a caregiver reach someone else’s record.

Real care runs through proxies: the parent for the infant, the adult child for the aging parent, the caregiver for someone with a disability. Health-system-side authorization solves it trivially by surfacing relationships that already exist; the credential-service-provider and third-party-app models handle it badly.

Almost no patient-facing tool models the caregiver — and the caregiver is who’s actually using it.

💡 80/20: If you’re building anything patient-facing, design the proxy relationship on day one. The medically-complex kid and the aging parent are your real users, not the idealized solo patient.

NVIDIA Opened Its Model Stack — With a Healthcare Tilt

NVIDIA dropped Nemotron 3 Ultra and the omnimodal Cosmos 3 — open weights and published training data and recipes — and previewed RTX Spark, a roughly one-petaflop personal “AI computer.” The announcement explicitly name-checks healthcare AI.

For a clinician running Ollama or LM Studio, a US-made open model the community rates “one notch below frontier,” with its training data published, is a real local option for clinical NLP that never phones home.

The most private clinical model is the one running on hardware you own — and that just got a lot more capable.

💡 80/20: When RTX-class local compute lands on a desk, the calculus flips: prototype on de-identified notes on a machine in your own office, no cloud, no BAA. Watch this rail.

Ultra-short:

Figma Make now edits your production code. Design-to-code AI keeps eating the gap between mockup and shipped UI — handy if you’re vibe-coding the front end of a clinical prototype. (Figma)
A widely-read AI newsletter is quitting daily coverage for weekly depth. The reasoning is the signal: when execution is cheap, the scarce skill is knowing what to build — and daily news can’t teach that. (Nate’s Newsletter)
SpaceX reportedly spent $12.7B on AI in 2025 — about 3x its rocket budget. Not health, but it’s the same spend-now-measure-later reflex CFOs are now reining in everywhere. (Motley Fool)

🎙️ From the Pods

🎙️ The 229 / This Week Health — “Shadow AI, Shrinking Budgets, and the Agents Nobody Approved”

One CIO put network-monitoring on his stack and found 50+ AI agents running he’d never approved; another thought he had 25–30 and turned up well over 100 — many of them arriving silently inside vendor upgrades (the Workday install that quietly grew agents). The other half of the conversation: the data-literacy paradox — “can’t we just use AI to clean the data so we can use it for AI?”

🎙️ The Heart of Healthcare — “Digital Health Download: June 2026”

Wearables, written off a decade ago as hardware nobody wanted to fund, are now one of digital health’s most durable categories — Whoop raised $575M at a $10.1B valuation, growing 103% year over year and cash-flow positive, with Oura close behind. The thing that changed: hardware plus a subscription people keep paying for.

💡 Builder take: The durable consumer-health model isn’t a one-time sale — it’s a device that earns a monthly fee by delivering an insight worth re-buying. Align the product incentive with recurring value or the wearable ends up in a drawer.

🤝 Selling It

🤝 The “We’ll Just Build It Ourselves” Conversation

This is the most common objection a clinician-builder hears in 2026 — and now the most credible one, because open-source clinical MCP primitives and FHIR-to-MCP bridges have made “we’ll just build it” sound real instead of like a bluff.

Don’t argue they can’t. Agree, then move the conversation from features to total cost of ownership: the build-it-yourself option is exactly the one the platform roll-ups are betting buyers will regret. “Some systems will build it, and some will be glad they did. The ones who tried in 2024 are why there’s now a multibillion-dollar company offering to take it off their hands.”

💡 Try this: Before your next pitch, write the three maintenance liabilities your tool absorbs that a home-built version dumps on their own staff — spec changes, model drift and safety monitoring, and key-person risk — with a name and an hours-per-quarter estimate next to each. That one page is your answer to “we’ll build it ourselves,” and it’s the only slide a CFO actually reads, because it’s about TCO, not features.

🧰 Builder’s Tip

Prompt Template: Make the model write a better prompt before you ask the real question

The cheapest upgrade to any clinical AI answer is fixing the prompt before you run it. Meta-prompting — having the model draft the ideal prompt first — catches the automation, confirmation, and sycophancy traps before they reach a patient. Paste this, drop in your one-breath question, and review the rewrite before you let it run:

You are helping a physician run a rigorous clinical literature search.
Before answering, REWRITE my question as the ideal prompt:
- Structure it PICO-style (Population, Intervention, Comparison, Outcome)
- State that you will show your reasoning step by step
- Request a GRADE rating for the strength of evidence
- Surface at least one counterfactual / disconfirming finding
- Restrict sources to peer-reviewed literature, and REFUSE to cite
  anything you cannot ground in a real, resolvable source
Then show me the improved prompt and WAIT for my go-ahead before running it.

My rough question: [paste your one-breath clinical question here]

Runs on any model, on zero PHI. The refuse-to-fabricate clause is the part that matters: a confident answer with an invented citation is worse than no answer.

💡 BTW

💡 BTW: Jensen Huang — whose NVIDIA just open-sourced the model stack above — got his first job at a Denny’s in Portland at age 15, working as a dishwasher, then busboy, then waiter. Two decades later, he sketched out the company that became NVIDIA in a booth at a San Jose Denny’s.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?