ChatGPT Health flunks triage 🚨, Pharmacy AI raises $80M 💊, code proofs go open-source 🔓
OpenAI built ChatGPT Health with 260 physicians and 600,000 rounds of clinician feedback. It just failed its first independent safety evaluation — undertriaging 52% of emergencies, including a case where its own reasoning identified respiratory failure and then told the patient to wait. Nate’s Newsletter pulled the thread yesterday and found four structural failure modes that aren’t medical. They’re properties of how every LLM agent behaves in production. If you’re building clinical tools, the question isn’t whether your agent has these blind spots. It’s whether you’ve built the infrastructure to find them.
🔬 The Big Thing
Your agent knows the answer is wrong. It says it anyway. And the evaluation frameworks most people use won’t catch it.
Nate’s Newsletter published a piece yesterday that every clinician-builder should read: “A Single Sentence from a Family Member Shifted an AI Diagnosis 12x. That Anchoring Bias Is in Your Agents Right Now.” He’s pulling apart a Nature Medicine study from the Mount Sinai team — published last month but now getting its second wave of analysis as the implications sink in — and extracting failure patterns that apply far beyond healthcare.
The study results are stark. Among cases that three independent physicians unanimously classified as emergencies, ChatGPT Health directed patients away from the ER 52% of the time. It correctly caught classical presentations — stroke, anaphylaxis — but missed the subtle ones. Diabetic ketoacidosis. Impending respiratory failure. In one asthma scenario, the tool’s own reasoning trace identified early signs of respiratory failure. The output said to schedule an appointment in 24 to 48 hours. The system knew the answer and gave a different one.
The anchoring bias finding is what stopped me cold. When a family member or friend minimized the patient’s symptoms — a single dismissive sentence — triage recommendations shifted toward less urgent care with an odds ratio of 11.7. One sentence. A twelve-fold shift. In a tool used by 40 million people daily. Think about that in the context of any clinical AI tool: the patient says their chest hurts, their spouse says “he always does this,” and the agent downgrades the urgency. That’s not a hallucination. It’s a systematic susceptibility to social anchoring that mirrors a well-known human cognitive bias — except humans learn to recognize it, and this system doesn’t flag that it’s happening.
Nate’s contribution is extracting four structural failure modes from this study and showing they’re not medical problems — they’re properties of how LLMs behave in production across every domain: reasoning-output disconnect (the model identifies the problem correctly in its chain of thought but gives a contradictory answer); anchoring susceptibility (contextual framing shifts outputs in ways the model doesn’t disclose); edge-case collapse (performance follows an inverted U, with the most dangerous failures at the clinical extremes where getting it right matters most); and safeguard inconsistency (the suicide-crisis banner fired on vague emotional distress more than on patients describing specific plans).
Here’s what I keep coming back to: OpenAI did the safety work. Two hundred sixty physicians. Six hundred thousand feedback rounds. A custom safety framework. And these failures went undetected because the evaluation methods — the ones most teams use — weren’t designed to find them. The Mount Sinai team used a factorial design: 60 vignettes across 21 clinical domains, each tested under 16 contextual conditions (race, gender, social dynamics, insurance status), yielding 960 interactions. That’s orders of magnitude more rigorous than most agent evaluations. The cost is front-loaded — Nate estimates month six costs a fraction of month one.
If you’re building any clinical tool that uses an LLM — even one that “just” summarizes notes or “just” answers patient messages — ask yourself: have you tested what happens when the context shifts? When a family member minimizes symptoms? When the patient is uninsured versus insured? When the presentation is subtle rather than classical? If the answer is “we tested it on straightforward cases and it worked great,” you’ve evaluated the middle of the curve. The failures live at the edges, and the edges are where patients get hurt.
Nate’s Newsletter · Nature Medicine study · Mount Sinai press release
📡 Builder’s Radar
Latent raised $80M at a $600M valuation — and a pharmacist’s quote tells you why.
Latent announced an $80M round yesterday, led by Spark Capital and Transformation Capital, valuing the company at $600M. They build AI for specialty pharmacy prior authorization — the paperwork bottleneck between a physician ordering a specialty drug and the patient actually getting it. The quote that caught my eye came from Ochsner Health’s chief pharmacy officer: “I couldn’t hire enough people. There was no way for me to continue that growth volume unless I turned to some type of tool to help with the workflows.” Latent has 45 health systems as clients, including Yale New Haven, UCSF, Mount Sinai, Vanderbilt, and Ochsner. For clinician-builders, this is a useful case study in where AI capital is flowing: not to replace clinicians, but to automate the administrative friction that bottlenecks clinical workflows. Specialty pharmacy prior auth is a clinician pain point that requires deep domain knowledge to specify correctly — exactly the kind of problem where a clinician-builder has an edge.
Turquoise Health raised $40M to become the operating system for healthcare contracts.
Turquoise Health closed a $40M Series C on Monday, led by Oak HC/FT with a16z and Adams Street participating, bringing total funding to $95M. Their platform uses AI to ingest static PDF-style payer contracts and automatically tag rates and provisions, with a conversational AI layer called AskTQ that reduces weeks of manual contract research to seconds. They currently serve 10 of the top 25 health systems and 4 of the top 5 national payers. The clinician-builder angle: healthcare pricing and contract management is one of those deeply unsexy problem domains where the people who understand it best — revenue cycle leaders, pharmacy directors, practice managers — have been doing manual work that AI should have eaten years ago. Turquoise’s traction suggests the market agrees.
Turquoise Health announcement · HIT Consultant
🛠️ From the Workbench
Leanstral: Mistral shipped an open-source agent that formally proves your code is correct.
Mistral released Leanstral on Monday — the first open-source AI agent purpose-built for Lean 4, the formal proof assistant used in mathematical research and verified software development. With 120B total parameters but only 6B active per token (mixture of experts), it can formally prove that AI-generated code meets its specifications. Apache 2.0 licensed, free API endpoint available.
Why this matters for clinician-builders: all week we’ve been talking about the gap between “AI can write code” and “AI can write code you can trust.” Docker sandboxes contain the blast radius. RULES files give your agent memory. But formal verification is a different category — it’s mathematical proof that code behaves as specified. Leanstral isn’t something most clinician-builders will use directly today (Lean 4 has a steep learning curve). But the direction it points toward — AI agents that can prove their own outputs are correct — is the endgame for trustworthy vibe coding in regulated environments. When someone asks “how do you know this code does what you think it does?”, formal verification is the answer that doesn’t require human code review. We’re not there yet for general-purpose code. But the fact that an open-source agent can now do this for Lean is a meaningful step.
Leanstral announcement · The Register coverage
What are you building this week? Reply and tell me — I read every one.
— Kevin

