Health AI flunks the stress test 🧪, Carbon Health pays $4.4M for the org chart ⚖️, The voice-AI gold rush gets frothy 📞

Jun 30, 2026

The benchmark says ready. The stress test says no.

A new Nature Medicine paper put the flagship frontier models — GPT-5, Gemini, the usual leaderboard — through adversarial stress tests on health tasks.

The finding is unsettling in a specific way: the models could often guess the right answer with key inputs removed, then get confused by a trivial change in wording — fabricating fluent, convincing reasoning traces for the wrong conclusion.

The same authors showed that popular health benchmarks vary wildly in what they actually measure. A high score on one isn’t a high score on “is this safe in a clinic.”

Readiness was never the accuracy number. A system that aces the benchmark and breaks on a rephrase hasn’t learned medicine — it’s learned the shape of the test.

Here’s why the timing matters. A federal program, ARPA-H’s ADVOCATE, is funding autonomous agents to adjust heart-failure meds and appointments on a three-year FDA path — built on exactly these models.

And the ground underneath them is getting noisier. Ambient AI wrote the note. An AI coding tool picked the diagnosis. The next model trains on that corpus and grades itself against the thing it helped write.

A benchmark is not the territory. It’s the model exposed to our particular way of asking — and we keep mistaking the reflection for the patient. The most honest thing in the whole paper is that we don’t yet have a measurement that survives contact with a clinic. That’s not a reason to look away. It’s the most interesting place to build.

😤 “Every model fails adversarial tests. Humans do too — show me the doctor who never anchors.” True, and that’s the right comparison to demand. The difference is calibration: a good clinician usually knows the edge of their knowledge and slows down; these models narrate maximum confidence right off the cliff. The fix isn’t a smarter model, it’s an instrument that catches the overconfidence before a human signs it.

😤 “This is academics moving the goalposts. The models keep getting better.” They do. And the goalpost should move — that’s what “readiness” means in a field where being wrong has a real implications.

😤 “So your answer is wait.” No. My answer is measure the right thing, then ship the narrow piece that survives the measurement. “Wait” and “deploy autonomously” are not the only two options.

❓ What product lives in the gap between the benchmark and the bedside? Not another leaderboard — something that tells a clinician, in the moment, “this output is the kind the model is brittle on.” I think there’s a real tool in the wobble itself, and I can’t quite name it yet.

California just fined a primary-care chain $4.4M for who owned it.

The state’s first-of-its-kind settlement with Carbon Health found the MSO effectively controlled medical decisions — violating California’s ban on the corporate practice of medicine — and misled patients on billing. Carbon pays $4.4M; co-founder Eren Bali pays $100K.

If you’re building a care-delivery company with a “friendly PC” structure, the org chart is now a regulatory surface, not a legal footnote.

The “friendly PC” wrapper that everyone copies from everyone is now a thing an attorney general will read line by line.

😤 “This is a California problem.” It’s a California first. CPOM statutes exist in a lot of states and have been gathering dust; AG offices just learned the structure is enumerable. Don’t bet your cap table on dust.

💡 80/20: If your model employs clinicians through an MSO, have someone who isn’t your incorporating lawyer answer one question: who actually controls the clinical decision? If the honest answer is “the company,” you have a Carbon problem in miniature.

Harvard trained a foundation model to predict your next diagnosis.

Circulating hard in clinician feeds this week: MGB’s DT-Transformer, a GPT-style model trained on 57 million structured EHR events that forecasts a patient’s next diagnosis — and roughly when — across nearly 900 diseases.

The build lesson isn’t the model. It’s that “read a lifetime of diagnoses as a sequence and predict the next token” is now a tractable pattern on real records, not a research toy.

❓ The validated model exists and a patient’s own data is right there in the portal export. “Wait for your cardiologist” — is that caution, or is it the same paternalism the #WeAreNotWaiting crowd already routed around once?

Connecticut’s AI disclosure mandate goes live tomorrow.

As of July 1, Connecticut’s amended privacy law requires LLM-backed chat and voice agents to disclose, in the privacy notice, that personal data may be used to train the underlying model. Meanwhile Colorado just narrowed its landmark AI Act and pushed it to January 2027.

The compliance perimeter isn’t a wall you build once — it’s a different shape in every state, and it moved twice this month.

Ultra-shorts

Graham Walker, MD (MDCalc / Offcall) — flags a Pediatrics paper on sociodemographic variability in AI-driven pediatric ED decisions, and says it finally gave him language for his unease.

Yair Saperstein, MD (AvoMD) — the #1 line his Epic inbox served up overnight: “Are you caring for this patient?” The ambiguity in that one sentence is a whole product category for clinical AI that knows who owns the patient right now.

🎙️ From the Pods

🎙️ Health Tech Nerds Radio — “The Grand Roundup“

Two agentic voice-AI companies announced big rounds on back-to-back days — and both press releases carried a customer quote calling that vendor “the only true platform” after an exhaustive RFP. Same week, opposite winners.

The unsexy question the hosts kept circling: voice AI is token-intensive, and nobody’s shown the gross margins versus a human call center once the subsidy burns off.

💡 Builder take: When two competitors both claim to be the only platform, the differentiator buyers can’t fake is unit economics. If you’re in this space, know your cost-per-resolved-call cold — that’s the number that survives the froth.

🔇 Speaker Blindspot: Survivorship bias — the case studies that travel are the practices that saw a million dollars in new revenue. The clinics where the IVR jammed up the schedule don’t get a slide.

🎙️ The 229 — “Fable Goes Dark and the Demo Derby Problem“

A powerful coding model — Mythos — got pulled offline under federal pressure after someone social-engineered it past its guardrails. Drex DeFord’s line is the pearl: guardrails on generative models “are not deterministic — they are probabilistic.” The same model that audited a health system’s codebase in two days was handed to the attackers too.

💡 Builder take: If your security story is “the model has guardrails,” you don’t have a security story. Treat model guardrails as a probabilistic filter, and put deterministic controls — scopes, allow-lists, human gates — around anything that can act.

🔇 Speaker Blindspot: Appeal to common practice (tu quoque) — the vendor’s defense was “everybody else’s model has the same problem.” True, and completely irrelevant to whether yours should be wired into a hospital network.

💡 BTW

💡 BTW: Eren Bali — the Carbon Health co-founder who just personally paid $100K in that California settlement — was born to Kurdish parents in an apricot-farming village in Malatya, Turkey, where his mother taught grades one through five in a one-room schoolhouse. He taught himself math on early internet forums, won gold at the Turkish Math Olympiad and silver at the International Math Olympiad, then co-founded Udemy before Carbon. Wikipedia

📅 Upcoming: ONC Health IT Certification Developer Roundtable (Wed Jul 1, free) — open to any developer touching an EHR, not just certified ones.

What are you building this week? Email and tell me (kevin@clinicians.build) — I read every one.

— Kevin & AI

clinicians.build

Discussion about this post

Ready for more?