AI agents out-diagnose ED docs 🤖, Midjourney scans your whole body 🫀, Optimism bias dressed as federalism 🧑‍⚖️

Jun 18, 2026

🔬 The Big Thing

Two Nature papers moved agentic AI from “help me diagnose” to “manage the whole case” — and it beat the doctors

MIRA (Jakob Kather’s lab) ran two agents across 500 real emergency-department cases and beat four board-certified physicians on diagnostic accuracy — 87.8% vs 78.1% — plus therapy selection, guideline alignment, and medication safety.

The same week, Google’s AMIE managed 100 patients across three longitudinal visits and came out non-inferior to 21 primary care physicians, scoring higher on management-plan quality and treatment precision.

These aren’t chatbots answering trivia. MIRA navigated 11 tools and 85,000+ possible actions, ordered labs and cultures, held up against 880 adversarial prompts, and leaked case data in 0 of 933 runs.

But read the methods, not the abstract: text-only inputs, clean curated datasets, no images, no non-verbal cues, MIRA capped at 20 turns, AMIE tested against patient-actors. As Eric Topol put it, “nothing... that truly represents the practice of medicine.”

The thing that beat the doctors never met a diaphoretic, dyspneic patient who said “I’m fine.”

😤 “Simulation results. Wake me when it survives a real shift.” Fair — and also not the point. Two years ago the frontier couldn’t manage a single case end-to-end; now it’s non-inferior across three visits in a controlled trial. The trajectory is the news. The bottleneck didn’t disappear, it moved — from “can a model do this at all” to “can you prove it does this safely on your population.”

😤 “87.8% means it’s confidently wrong one time in eight.” Yes. And that’s the whole product. A system that’s right 88% of the time and can’t tell you which 12% is a liability; a system wrapped in a layer that routes the ambiguous cases to a human is a tool. The accuracy number is the marketing. The calibration is the medicine.

😤 “So AI is replacing doctors now.” It’s replacing the clean version of what doctors do. Nobody’s automated the part where you read the room.

💡 80/20: The model is rented and swappable — MIRA ran on GPT-4o, AMIE on Gemini. What’s yours is the acceptance test: a set of cases from your real population, with a pass/fail grader, that you can re-run against whatever model ships next. Build the test before you fall in love with the model. (Want to try it on synthetic data? See today’s Tip.)

📡 Builder’s Radar

Midjourney just announced a full-body scanner — the image-AI company is now a hardware-and-imaging company

Midjourney unveiled “Midjourney Medical” and the Midjourney Scanner — an ultrasound-based full-body imaging rig the company calls “Ultrasonic CT,” built with Butterfly Network’s ultrasound-on-chip (reportedly ~8,960 transducers in a ring, no radiation, no magnets).

Founder David Holz is self-funding it, talking about ~50,000 scanners and a billion scans a month, and a flagship “Midjourney Spa” in San Francisco.

The radiology isn’t the hard part — the downstream is. Holz admitted the demo images weren’t even AI-processed yet, and a billion low-resolution full-body scans a month is a tidal wave of incidentalomas looking for someone to adjudicate.

😤 “It’s a hot tub with an ultrasound probe.” Yeah, I guess so. But Butterfly-on-chip is real, and an AI-native company treating the scan as raw data for a model — rather than the finished product — is a genuinely different posture than legacy imaging.

💡 80/20: This is super weird. Someone like US more than ER docs, who would have guessed it was Midjourney?

Cardiac-monitoring vendor iRhythm got its patient data stolen — through a phone call

iRhythm disclosed in an SEC filing that an attacker socially engineered their way into third-party business apps and is now ransoming patient PHI. iRhythm has processed heartbeat data from 12M+ patients.

No zero-day, no exploit — somebody talked their way in. The most-quoted attack surface in health tech this year isn’t a model or an API. It’s a help-desk that wants to be helpful.

💡 80/20: If you build anything that touches PHI, your threat model has to include the human who’ll reset a password for a confident voice on the phone. Tabletop the social-engineering path, not just the firewall.

Quick hits

The five revenue streams hiding in every PBM contract. Pharmacist and self-funded-plan auditor Ginny Crisp lays out why single-point PBM reform keeps failing: there are five simultaneous money streams (spread pricing, rebate retention, admin fees, manufacturer-direct payments, owned-pharmacy margin) and banning one just reroutes margin to another. Her line: “a tollbooth where the toll collector also owns the cars, the gas stations, and the road repair contracts.” If you’re building price-transparency or appeals tooling, this is the plumbing you’re swimming in.

The moat is the lab, not the model. On Latent Space, Radical AI’s Joseph Krause argues that in materials science the model is cheap and the experimental data is the moat — so they built a “self-driving lab” that runs 25+ alloys a day — hundreds in months, against academia’s ~3,500 high-entropy compositions in 40 years. Swap “alloy” for “clinical workflow” and it’s the same lesson: whoever owns the hard-won real-world data owns the defensible part.

🎙️ From the Pods

🎙️ NEJM AI Grand Rounds — “OpenAI’s Karan Singal on HealthBench and the Future of Medical AI” [verify exact episode link before publish]

Singal, who leads health AI at OpenAI (and built MedPaLM before that), makes the case that the field’s real progress isn’t model size — it’s that we finally have meaningful evaluations like HealthBench, built with clinicians, that tell you whether a model is actually good at medicine.

🔇 Speaker Blindspot: Appeal to the artifact — a benchmark is “nature exposed to our method of questioning.” HealthBench measures what HealthBench measures; the failure modes that matter most are the ones nobody thought to write a test for. A great eval reduces uncertainty in the region you already suspected, not the region that’ll actually hurt you.

🎙️ Health Tech Nerds Radio — “The tasks AI should take off doctors’ plates — and the ones it shouldn’t” (Hashem Zikry, Counsel Health)

Zikry maps the regulatory patchwork — every state has filed AI legislation, ranging from New York/Colorado’s tight limits on patient-facing AI to Utah’s sandbox where AI is “fully practicing medicine” (still human-reviewed in practice). His prescription: a federal floor, then state experimentation on top.

💡 Builder take: Build for the floor that’s coming, not the gap that exists today. A tool that’s safe under the strictest current state law travels everywhere; one that only works in the Utah sandbox has a market of one state.

🔇 Speaker Blindspot: Optimism bias dressed as federalism — “laboratories of democracy” sounds tidy, but 50 divergent state regimes is a compliance nightmare for any builder operating across state lines, and the framing quietly assumes the floor arrives before the fragmentation calcifies. History says the fragmentation usually wins the race.

🎙️ The 229 Podcast — “Innovating at the Speed of Trust” (Shiv Rao, Abridge)

Rao’s framing: “It’s so easy to create a party trick, and a totally different endeavor to create an enterprise-grade, healthcare-grade product.” The conversation pushes past ambient documentation toward the revenue cycle as the next surface.

💡 Builder take: The demo earns the meeting; the enterprise-grade plumbing earns the contract. If your weekend prototype dazzles, that’s table stakes — the real work is everything between the party trick and production.

🔇 Speaker Blindspot: Motivated reasoning — a cardiologist-founder defining the moat as “really understanding the clinical workflow” is also, conveniently, defining it as a thing that’s easiest to cross if you already have his scale and his cap table. True, and self-serving at the same time.

🧰 Builder’s Tip

Tool Spotlight — turn the Big Thing into something you can actually run: open-source LLM eval with promptfoo.

The MIRA/AMIE papers argue the scarce skill is the acceptance test. You can build a tiny version of one tonight, on synthetic data, zero PHI. Install it: npx promptfoo@latest init, point it at a handful of Synthea synthetic patients, write one narrow clinical task (e.g., “flag any med-list discrepancy”), and define pass/fail assertions plus a few adversarial cases. Run npx promptfoo eval and you get a model-vs-model grid showing exactly where each one breaks.

The payoff: a reusable, model-swappable test harness — the portable artifact a CMIO actually wants to see, and the thing that stays yours when the underlying model gets deprecated next quarter. (Synthetic data only; never point it at real patient data outside a BAA-covered environment.)

📅 Upcoming: Adoption of AI in Clinical Care: Updates from the HHS RFI (ONC/HHS, Jun 25). Full events calendar →

What are you building this week? Email and tell me (kevin@clinicians.build) — I read every one.

— Kevin + AI agents

clinicians.build

Discussion about this post

Ready for more?