Every AI missed the pneumothorax 🫁, Congress kills the AI prior-auth pilot 🪦, Claude goes Mythos-class 🤖

Jun 10, 2026

Every Tool Read the Film. Every Tool Missed the Pneumothorax.

Sam Ashoo, MD — EM and clinical informatics, the same doc who ran ECGs through these tools last week — fed two typical ED X-rays to 11 medical AI tools.

The chest film had a left apical pneumothorax. Every system that attempted it called the film normal — and explicitly denied a pneumothorax. Including the radiology-specific products.

The pediatric elbow told a different story. Several tools correctly called the supracondylar fracture; the same vendors that missed the pneumo got the elbow right.

Competence doesn’t transfer. A model that nails one imaging task tells you nothing about the next one — and polished output is exactly what hides the difference.

The most honest answers in the whole benchmark were refusals: HeidiHealth and Glass Health declined to interpret images as out of scope. The tools that knew their limits beat the tools that didn’t.

The same morning, Jan Beger — Global Head of AI Advocacy at GE HealthCare — published the other half of this story: every validation is a snapshot of a frozen world, and deployment is the thing that unfreezes it. Clinicians adapt to the tool, trainees learn from its output, and the conditions the validation assumed quietly dissolve.

He anchors it in a May NEJM AI randomized trial: 44 physicians — all of whom had completed AI-literacy training — saw diagnostic accuracy fall from 84.9% to 73.3% when the model’s suggestions contained planted errors. They were free to ignore the AI. They followed it anyway.

Here’s the part I find interesting rather than scary: a benchmark is a measurement of the world at the instant you froze it. The map is never the territory, and a system can’t fully validate itself from the inside. That’s not a flaw in our benchmarks — it’s a property of measurement. The question is who builds for it.

😤 “Two X-rays isn’t a benchmark. No stats, no CI, n=2 — this is an anecdote.” Correct — and that’s the point. It took exactly one film to falsify eleven marketing pages. You don’t need a power calculation to learn that “accepts image uploads” is not the same claim as “reads images.” The anecdote isn’t the evidence base; it’s the smoke detector.

😤 “Radiologists miss pneumothoraces too.” They do. And when they do, there’s an M&M, a peer review, and a name on the read. Show me the AI tool’s M&M.

😤 “The next model version will fix this.” Maybe. But “fixed” is precisely the claim a frozen-world validation can’t support — and the NEJM trial says the dangerous failure mode isn’t the model being wrong, it’s the trained human following it when it is.

Congress Just Voted to Kill Medicare’s AI Prior-Auth Pilot — Unanimously

House appropriators voted unanimously on June 9 to defund WISeR, the CMS pilot using AI plus human review to screen “wasteful” services in six states.

The model started January 1. It barely got five months of runway before the politics caught up with it — and the opposition was bipartisan from the start.

“AI that says no to Medicare patients” turned out to be the one thing both parties could agree to kill. Political durability is now a design constraint.

😤 “It’s an appropriations rider — the model isn’t actually dead yet.” True, and the markup still has to survive the full process. But a unanimous committee vote tells every model designer at CMMI exactly where the ceiling is for AI-flavored utilization management. Read it as a forecast, not a funeral.

Anthropic Shipped a “Mythos-Class” Model — and a Two-Tier Access World

Anthropic released Claude Fable 5, generally available, and Claude Mythos 5 — the same underlying model with safeguards lifted in some areas, restricted to a small group of cyberdefenders and infrastructure providers, days after publicly calling for an industry slowdown.

Part of the justification: Anthropic demonstrated the Mythos class can turn known Windows and Firefox vulnerabilities into working exploits in hours.

The capability ceiling moved up, but the precedent matters more: frontier capability is now tiered by who you are, not just what the model can do.

😤 “The safety layer will silently nerf my clinical queries.” The announced design falls back to Opus 4.8 on a small share of sessions — which makes “what happens to my query when the safeguard fires, and do I get told?” a legitimate procurement question now. Ask it. The right demand isn’t “no safeguards”; it’s “show me when they fired.”

💡 80/20: Pricing is $10/$50 per million tokens. If you built the eval harness, swapping models is an afternoon and the harness tells you whether the upgrade is real for your task. If you didn’t, you’re reading benchmarks — see the above.

The Pentagon May Be Un-Bundling America’s Second-Biggest EHR Rollout

A new federal solicitation names the component vendors behind MHS GENESIS directly — Oracle Health for the core EHR, Philips for tele-critical care, Amwell for telehealth, Solventum for documentation and revenue cycle, Henry Schein for dental — a signal the Defense Health Agency may be moving away from a single prime integrator. Read it as direction of travel, not a done deal: Leidos remains the prime on the current sole-source extension, Amwell delivers virtual health under that Leidos-led team today, and open competition for sustainment isn’t expected to reach the market until ~2028.

The Medicare GLP-1 Bridge Goes Live July 1 — Look at the Plumbing, Not the Drug

The bridge program gives eligible Part D beneficiaries GLP-1s at $50 a month against a $245 negotiated net price, starting in three weeks. The program details are public, the ePA flow runs through CoverMyMeds’ rails, and Humana sits in the middle as central processor.

CMS stood up a national prior-auth-and-claims rail for one drug class in a matter of months. That rail is a spec — read it like one. The intake, eligibility, and adjudication plumbing around GLP-1s is becoming its own product category.

Virtual Menopause Care Is a Regulatory-Reset Story, Not a Demand Story

A sharp analysis this week argues the boom (Midi, Alloy, Elektra, Stella, Winona) traces to a clinical reset: FDA removed the black-box warning from six HRT products last November, ACOG reaffirmed its guidance, and the care model happens to fit virtual delivery unusually well.

A label change created a category overnight. Worth keeping a running list of regulatory resets in your own specialty — the next one is somebody’s company.

Ultra-short:

OpenAI confidentially filed for an IPO. The S-1 is in, Anthropic reportedly filed first, and your AI vendors are about to have quarterly earnings pressure. Price stability and deprecation schedules just became diligence questions.

Cognition shipped FrontierCode. A benchmark built by open-source maintainers that scores whether AI code is mergeable, not just test-passing — the best frontier model clears ~13% on the hardest subset. The gap between “compiles” and “a maintainer would accept this” is the same gap as “reads images” vs. “missed the pneumo.”

HLTH and ViVE have a new landlord. Hyve — the events group behind both — sold to PE firm Hellman & Friedman for a reported $1.8B. New owner, return targets — assume your cost-per-buyer-conversation at the booth goes up, and re-underwrite accordingly.

One-Click Deploy (DrClaw)

A pre-launch from an MD builder: clinicians can vibe-code on Lovable or Replit but hit a wall at production — no BAA, no compliant deployment path, no engineer. This claims to take a project from sandbox to production with end-to-end HIPAA compliance and an executed BAA, no agency required. It’s pre-launching to a small clinician cohort before general availability.

⚠️ Verify: “End-to-end HIPAA compliance + executed BAA” is a vendor claim. Before any real patient data: confirm who signs the BAA and for which services, where data lives, and what their incident-response obligations are. Get it in writing, not on the landing page.

😤 “A HIPAA wrapper around vibe-coded apps is a liability factory.” It could be — or it’s the missing rail that turns a thousand weekend prototypes into deployable tools. The interesting question is what the compliance layer actually checks about the code it deploys, not just the infrastructure it deploys onto.

😤 “Why not just learn to do compliant infra yourself?” You should understand it either way. But “every clinician-builder must become a cloud-compliance engineer” is exactly the kind of gatekeeping that kept clinical software in vendor hands for twenty years.

Kinetic Systems

A fresh Stanford spinout from Nigam Shah’s lab (Chief Data Scientist at Stanford Health Care) building physician workflow automation — understand, automate, and monetize your workflows. Early, light on public detail, but the pedigree is exactly the clinical-AI-evaluation lineage worth tracking.

⚠️ Verify: no public security or compliance documentation yet — treat any workflow that touches patient data as off-limits until a BAA and architecture details exist. [Also see Epic AI Factory]

🎙️ From the Pods

🎙️ Lifers with Christina Farr — “Dr. David Carmouche, Lumeris: Why AI is primary care’s best chance at survival”

Carmouche — ex-Ochsner, ex-Walmart Health, now Lumeris — says a customer wants a PCP panel of 10,000 patients, and the honest path even to 5,000 runs through autonomous AI. His strongest argument is longitudinal: the slowly progressive anemia, the creeping arthralgias across five years of notes — patterns AI sees trivially and a 15-minute-visit human structurally cannot.

💬 “Going from 1,700 to 10,000 is so massive. So let’s pick an interim point. Let’s pick 5,000.” — Dr. David Carmouche

💡 Builder take: The longitudinal-trend surface is wide open — tools that trend a panel’s quiet drift (weight, hemoglobin, eGFR) and surface it at the visit are buildable today on FHIR data you can already access.

🔇 Speaker Blindspot: False analogy — the autoland argument borrows certainty from aviation, where autoland is certified against decades of FAA test data, simulator hours, and mandatory incident reporting. Primary care AI has none of that validation infrastructure yet (see above). The analogy argues for building the certification regime, not for trusting the autopilot.

🧰 Builder’s Tip

Mindset / Strategy — Schedule the second-user moment.

Your tool isn’t real until someone you didn’t coach uses it without you in the room. Most clinician-builders postpone that moment indefinitely — the demo is always “two weeks away” — because watching someone fumble your interface hurts.

Put it on the calendar this week: one colleague, one synthetic case (Synthea patient or invented vignette — never real PHI), and you sit on your hands. No narrating. Write down every place they hesitate; that list outranks your feature backlog.

You already run this loop clinically — you don’t trust a resident’s airway skills based on their description of one. Same standard for your own product.

💡 BTW: Drex DeFord — whose CISA story leads today’s pods — was a rock-n-roll DJ before joining the Air Force, where he spent 20 years and rose to CTO of Air Force Health’s worldwide operations, then ran IT for Scripps, Seattle Children’s, and Steward. The DJ-to-CIO story is real.

What are you building this week? Email and tell me (kevin@clinicians.build) — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?