OpenAI ships ChatGPT for Clinicians đ©ș, GLP-1 Bridge swallows BALANCE đ, "predictive decoration" enters the lexicon đš
OpenAI just made ChatGPT free for every verified US clinician â and shipped the first serious benchmark for real clinician chat tasks.
OpenAI announced ChatGPT for Clinicians â free for any verified US physician, NP, PA, or pharmacist. Reusable âskillsâ for referral letters, prior auth, and patient instructions. Trusted clinical search with cited peer-reviewed sources. Deep research across medical journals. CME credits auto-tracked from clinical research sessions. An optional HIPAA BAA. Conversations not used for training. The AMAâs 2026 survey says 72% of physicians now use AI clinically, up from 48% a year ago.
Alongside it: HealthBench Professional, an open benchmark for real clinician chat tasks across three categories â care consult, writing and documentation, and medical research. 525 rubric-graded tasks. OpenAI reports physicians rated 99.6% of responses safe and accurate across 6,924 pre-release conversations.
The benchmark sets the evaluation bar for the category. The category, as of this week, is no longer âgeneral-purpose LLM with a medical prompt.â Itâs clinician-calibrated chat with a measurable floor.
đ€ Haters
âThis is just another model with a clinical coat of paint.â The model is a GPT-5.4 variant. Whatâs different isnât the weights â itâs the distribution, the benchmark, and the clinician-adjacent primitives (skills, cited search, BAA). Ship enough free clinician AI and the old moat â âweâre the one tuned on medicineâ â evaporates. Whether this is a good thing depends on where you sit.
âA Nature Medicine RCT literally just showed LLMs make patient self-assessment worse.â Right â Bean et al. in this weekâs Doctor Penguin randomized 1,298 UK participants and found LLMs did not improve (and sometimes worsened) symptom triage vs. normal home resources. Two users with near-identical subarachnoid hemorrhage descriptions got opposite advice. But that study is about patients using general LLMs. ChatGPT for Clinicians is built for clinicians, who bring domain priors patients donât have. The category distinction matters â and HealthBench Professional is OpenAIâs attempt to measure exactly that distinction.
âThe BAA is optional.â It is. That is the single most important sentence in the announcement and the one least likely to be highlighted in any compliance conversation. âOptionalâ means the default tier is not HIPAA-covered. Any clinician pasting a real patient encounter into a default account before provisioning the BAA is out of scope. Treat the BAA as a dependency, not a feature.
[nota bene: onboarding to this is kinda invasive and requires you to upload your driverâs license (at least)]
đĄ Builderâs Radar
The Medicare GLP-1 Pilot just died. The âBridgeâ got extended through 2027, and the prior-auth attestation is a PCP problem now.
CMS shelved the Part D piece of the BALANCE Model after UnitedHealthcare and Aetna declined to participate â without them, CMMI couldnât reach the 80% beneficiary coverage threshold. Instead, the Medicare GLP-1 Bridge was extended through December 31, 2027. The Bridge operates outside Part D: federal appropriation, not plan risk, $50/month copay that does not count toward the $2,100 OOP cap. Wegovy and Zepbound only. Prior auth attestation required â and attestation is clinical eligibility documentation by the prescribing clinician.
đ€ Haters
âThis is a policy story, not a builder story.â The attestation workflow is the builder story. A panel of 2,000 primary-care adults likely contains 600-800 Bridge-eligible beneficiaries. If the attestation template isnât in the EHR before July, the clinic either doesnât do the Bridge or the docs are clinically unsupportable. Someone has to build that template.
âAmazon One Medical already undercut this.â Same week Amazon launched GLP-1 through One Medical at $25/month insured or $149 cash (last weekâs issue covered the launch). Every minute of PCP scheduling friction on the Bridge is a referral out to Amazon. The retail-vs-PCP question for obesity medicine is now answered by workflow friction, not by clinical quality.
âPredictive decorationâ â someone finally named the thing every hospitalist sees on every patient list.
Anvesh Narimiti, a hospitalist and CI fellow, published a LinkedIn carousel this week coining the phrase predictive decoration â any risk score surfaced in the EHR without a paired decision support path. Epicâs 30-day readmission risk column is his canonical example: it pages case management and checks a box, but does not change a single medication, stay length, or follow-up interval for the acute-stay hospitalist.
Jennifer Goldman (CMIO at Memorial Healthcare) steelmanned the counterpoint in the comments: the same score, read by the primary-care team running express-lane follow-ups, does drive behavior. The score isnât decoration when a downstream operation is designed around it â but the hospitalist seat may not be the seat at which to evaluate it.
đ€ Haters
âSo the score is useful, just not to the hospitalist? Fine, thatâs not new.â The naming is new, and naming shapes what you build. A score is decoration until a specific seat in the workflow acts on it. The builder question â and the implementation question â is: for whom is this score actionable, and what is the action? If the answer is âcase management gets paged,â the score isnât supporting the hospitalist; itâs supporting case managementâs operating model. Design around that.
âThis is just Graham Walkerâs âdata slop â model slop â publication slopâ at a different layer.â Itâs the downstream half of the same story. Walkerâs piece was about a single bad Kaggle dataset generating 124 papers and 1,500 citations; Narimitiâs is about those models becoming a column on a patient list that doesnât change anything. Upstream: the data doesnât support the model. Downstream: the model doesnât support the decision. The full pipeline is the failure.
đĄ 80/20: A score that doesnât change a decision youâd make differently is decoration, not analytics. Try: for every AI-generated score, column, or alert you are about to build into a workflow, write the single sentence âIf I see X, I will do Y instead of Z.â If you canât write it, build the action first and the score second.
Cloudflare ran 131,246 AI code reviews in one month. The break-glass override rate was 0.6%.
Cloudflare Engineering published the numbers this week: a custom multi-agent code review system on top of OpenCode, seven specialized agents per merge request (security, performance, code quality, etc.), median 3m 39s per review, $1.19 average cost, 85.7% cache hit rate across 120 billion tokens. Most importantly: engineers overrode the AI reviewer on only 0.6% of merge requests.
đ€ Haters
âCode review is not clinical decision support.â Itâs not. But the 0.6% override rate is the closest public datapoint for âwhat does mature agentic adoption look like in a regulated production workflowâ that Iâve seen. The structure â multiple specialized agents, per-task cost budget, circuit breakers, override logged and measurable â is the structure a serious clinical-AI deployment needs. Healthcare has no equivalent published metric. Thatâs a gap.
âMeasuring override rate is the wrong metric for medicine.â Partially fair â in medicine the consequence of not overriding can be the harm itself, whereas in code review the CI pipeline catches a lot. But not having an override metric at all is worse than having the wrong one. Build the override telemetry before you ship the agent.
đĄ 80/20: Override rate is the single most useful trust metric for an agent in production, and almost no health-AI tool publishes one. Try: in the first week after any AI tool you build goes live, instrument override tracking â per-user, per-decision, with a free-text reason. Youâll learn more from the first 100 overrides than from any pre-launch benchmark.
đ§° Builderâs Tip
Tool Spotlight â anthropics/healthcare: the Claude Code healthcare marketplace that has been sitting there since January.
[Here today because of todayâs anthropic healthcare webinar.] Anthropic published an official healthcare plugin marketplace at github.com/anthropics/healthcare at JPM26 on January 9, 2026. v1.0.0, no commits since. It ships three Agent Skills â fhir-developer@healthcare (FHIR R4, LOINC, SNOMED CT, RxNorm patterns), prior-auth-review@healthcare (NPI / ICD-10 / CMS Coverage / CPT checks + medical-necessity summarization), and clinical-trial-protocol@healthcare (FDA/NIH-compliant protocol scaffolding) â plus four remote MCP servers covering CMS Coverage, the NPI Registry, PubMed, and ICD-10 codes.
Install the marketplace and a skill in two commands inside Claude Code:
/plugin marketplace add anthropics/healthcare
/plugin install fhir-developer@healthcare
You now have a FHIR-aware assistant with live access to CMS Coverage and NPI Registry and no API keys to set up. That is a real starter kit for a weekend prototype â synthetic patient data via Synthea, a FHIR-aware skill, and MCP access to coverage and identifier lookup, all free, all local to your Claude Code session.
The quiet tell: three months of community silence. The first community PR against anthropics/healthcare â a HIPAA audit skill, a medical-coding abstraction skill, a discharge-summary-extraction skill â lands on the canonical surface with almost zero noise. If youâve been looking for a concrete starting point for agentic clinical tooling, this is it.
What are you building this week? Reply and tell me â I read every one.
â Kevin


