OpenAI ships ChatGPT for Clinicians 🩺, GLP-1 Bridge swallows BALANCE 💊, "predictive decoration" enters the lexicon 🎨

Apr 23, 2026

OpenAI just made ChatGPT free for every verified US clinician — and shipped the first serious benchmark for real clinician chat tasks.

OpenAI announced ChatGPT for Clinicians — free for any verified US physician, NP, PA, or pharmacist. Reusable “skills” for referral letters, prior auth, and patient instructions. Trusted clinical search with cited peer-reviewed sources. Deep research across medical journals. CME credits auto-tracked from clinical research sessions. An optional HIPAA BAA. Conversations not used for training. The AMA’s 2026 survey says 72% of physicians now use AI clinically, up from 48% a year ago.

Alongside it: HealthBench Professional, an open benchmark for real clinician chat tasks across three categories — care consult, writing and documentation, and medical research. 525 rubric-graded tasks. OpenAI reports physicians rated 99.6% of responses safe and accurate across 6,924 pre-release conversations.

The benchmark sets the evaluation bar for the category. The category, as of this week, is no longer “general-purpose LLM with a medical prompt.” It’s clinician-calibrated chat with a measurable floor.

😤 Haters

“This is just another model with a clinical coat of paint.” The model is a GPT-5.4 variant. What’s different isn’t the weights — it’s the distribution, the benchmark, and the clinician-adjacent primitives (skills, cited search, BAA). Ship enough free clinician AI and the old moat — “we’re the one tuned on medicine” — evaporates. Whether this is a good thing depends on where you sit.

“A Nature Medicine RCT literally just showed LLMs make patient self-assessment worse.” Right — Bean et al. in this week’s Doctor Penguin randomized 1,298 UK participants and found LLMs did not improve (and sometimes worsened) symptom triage vs. normal home resources. Two users with near-identical subarachnoid hemorrhage descriptions got opposite advice. But that study is about patients using general LLMs. ChatGPT for Clinicians is built for clinicians, who bring domain priors patients don’t have. The category distinction matters — and HealthBench Professional is OpenAI’s attempt to measure exactly that distinction.

“The BAA is optional.” It is. That is the single most important sentence in the announcement and the one least likely to be highlighted in any compliance conversation. “Optional” means the default tier is not HIPAA-covered. Any clinician pasting a real patient encounter into a default account before provisioning the BAA is out of scope. Treat the BAA as a dependency, not a feature.

[nota bene: onboarding to this is kinda invasive and requires you to upload your driver’s license (at least)]

📡 Builder’s Radar

The Medicare GLP-1 Pilot just died. The “Bridge” got extended through 2027, and the prior-auth attestation is a PCP problem now.

CMS shelved the Part D piece of the BALANCE Model after UnitedHealthcare and Aetna declined to participate — without them, CMMI couldn’t reach the 80% beneficiary coverage threshold. Instead, the Medicare GLP-1 Bridge was extended through December 31, 2027. The Bridge operates outside Part D: federal appropriation, not plan risk, $50/month copay that does not count toward the $2,100 OOP cap. Wegovy and Zepbound only. Prior auth attestation required — and attestation is clinical eligibility documentation by the prescribing clinician.

😤 Haters

“This is a policy story, not a builder story.” The attestation workflow is the builder story. A panel of 2,000 primary-care adults likely contains 600-800 Bridge-eligible beneficiaries. If the attestation template isn’t in the EHR before July, the clinic either doesn’t do the Bridge or the docs are clinically unsupportable. Someone has to build that template.

“Amazon One Medical already undercut this.” Same week Amazon launched GLP-1 through One Medical at $25/month insured or $149 cash (last week’s issue covered the launch). Every minute of PCP scheduling friction on the Bridge is a referral out to Amazon. The retail-vs-PCP question for obesity medicine is now answered by workflow friction, not by clinical quality.

“Predictive decoration” — someone finally named the thing every hospitalist sees on every patient list.

Anvesh Narimiti, a hospitalist and CI fellow, published a LinkedIn carousel this week coining the phrase predictive decoration — any risk score surfaced in the EHR without a paired decision support path. Epic’s 30-day readmission risk column is his canonical example: it pages case management and checks a box, but does not change a single medication, stay length, or follow-up interval for the acute-stay hospitalist.

Jennifer Goldman (CMIO at Memorial Healthcare) steelmanned the counterpoint in the comments: the same score, read by the primary-care team running express-lane follow-ups, does drive behavior. The score isn’t decoration when a downstream operation is designed around it — but the hospitalist seat may not be the seat at which to evaluate it.

😤 Haters

“So the score is useful, just not to the hospitalist? Fine, that’s not new.” The naming is new, and naming shapes what you build. A score is decoration until a specific seat in the workflow acts on it. The builder question — and the implementation question — is: for whom is this score actionable, and what is the action? If the answer is “case management gets paged,” the score isn’t supporting the hospitalist; it’s supporting case management’s operating model. Design around that.

“This is just Graham Walker’s ‘data slop → model slop → publication slop’ at a different layer.” It’s the downstream half of the same story. Walker’s piece was about a single bad Kaggle dataset generating 124 papers and 1,500 citations; Narimiti’s is about those models becoming a column on a patient list that doesn’t change anything. Upstream: the data doesn’t support the model. Downstream: the model doesn’t support the decision. The full pipeline is the failure.

💡 80/20: A score that doesn’t change a decision you’d make differently is decoration, not analytics. Try: for every AI-generated score, column, or alert you are about to build into a workflow, write the single sentence “If I see X, I will do Y instead of Z.” If you can’t write it, build the action first and the score second.

Cloudflare ran 131,246 AI code reviews in one month. The break-glass override rate was 0.6%.

Cloudflare Engineering published the numbers this week: a custom multi-agent code review system on top of OpenCode, seven specialized agents per merge request (security, performance, code quality, etc.), median 3m 39s per review, $1.19 average cost, 85.7% cache hit rate across 120 billion tokens. Most importantly: engineers overrode the AI reviewer on only 0.6% of merge requests.

😤 Haters

“Code review is not clinical decision support.” It’s not. But the 0.6% override rate is the closest public datapoint for “what does mature agentic adoption look like in a regulated production workflow” that I’ve seen. The structure — multiple specialized agents, per-task cost budget, circuit breakers, override logged and measurable — is the structure a serious clinical-AI deployment needs. Healthcare has no equivalent published metric. That’s a gap.

“Measuring override rate is the wrong metric for medicine.” Partially fair — in medicine the consequence of not overriding can be the harm itself, whereas in code review the CI pipeline catches a lot. But not having an override metric at all is worse than having the wrong one. Build the override telemetry before you ship the agent.

💡 80/20: Override rate is the single most useful trust metric for an agent in production, and almost no health-AI tool publishes one. Try: in the first week after any AI tool you build goes live, instrument override tracking — per-user, per-decision, with a free-text reason. You’ll learn more from the first 100 overrides than from any pre-launch benchmark.

🧰 Builder’s Tip

Tool Spotlight — anthropics/healthcare: the Claude Code healthcare marketplace that has been sitting there since January.

[Here today because of today’s anthropic healthcare webinar.] Anthropic published an official healthcare plugin marketplace at github.com/anthropics/healthcare at JPM26 on January 9, 2026. v1.0.0, no commits since. It ships three Agent Skills — fhir-developer@healthcare (FHIR R4, LOINC, SNOMED CT, RxNorm patterns), prior-auth-review@healthcare (NPI / ICD-10 / CMS Coverage / CPT checks + medical-necessity summarization), and clinical-trial-protocol@healthcare (FDA/NIH-compliant protocol scaffolding) — plus four remote MCP servers covering CMS Coverage, the NPI Registry, PubMed, and ICD-10 codes.

Install the marketplace and a skill in two commands inside Claude Code:

/plugin marketplace add anthropics/healthcare
/plugin install fhir-developer@healthcare

You now have a FHIR-aware assistant with live access to CMS Coverage and NPI Registry and no API keys to set up. That is a real starter kit for a weekend prototype — synthetic patient data via Synthea, a FHIR-aware skill, and MCP access to coverage and identifier lookup, all free, all local to your Claude Code session.

The quiet tell: three months of community silence. The first community PR against anthropics/healthcare — a HIPAA audit skill, a medical-coding abstraction skill, a discharge-summary-extraction skill — lands on the canonical surface with almost zero noise. If you’ve been looking for a concrete starting point for agentic clinical tooling, this is it.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?