TEFCA's referee is an algorithm đ€, Claude plays doctor đ©ș, Real questions beat board exams đ
Who referees the network? Right now, an algorithm nobody can audit.
This week HHS touted that TEFCA has moved more than a billion health records â the national backbone that every FHIR-based product eventually leans on â and in the same motion announced new oversight muscle.
That muscle is a five-year, up to $5.62M contract awarded to Alliance Global Tech (AGT), a little-known 55-person federal IT contractor, to audit TEFCA participants and refer âcivilly or criminally actionableâ behavior â including information blocking and fraud â to OCR, HHS-OIG, and the DOJ.
Hereâs the part that should stop you: AGTâs website briefly advertised that it uses AI to flag participants for review, then scrubbed the claim â and the rules that algorithm runs on are undisclosed.
So the same office (ASTP/ONC) that proposed deleting more than half its certification rulebook this past winter is now enforcing whatâs left through a process no outside party can inspect. As one emergency-physician-turned-Epic-consultant put it, health systems are being asked âto walk a tightrope blindfolded, with steel shoes so they cannot feel the rope.â
đ€ âThis is just standard federal contracting â every network gets an auditor.â Sure. But most auditors publish their methodology. When the reviewer is a model whose ârules of the roadâ are undisclosed, âwe audited youâ and âthe algorithm flagged youâ become the same sentence â and you canât appeal a black box.
â If federal oversight is going to run on a model, whereâs the eval harness for the eval-er? We demand model cards and audit trails from every vendor selling into a hospital. Whoâs holding the governmentâs compliance algorithm to the same bar itâs about to enforce?
đ§Ș NEW: Try todayâs interactives (lets see how long we can keep this âfactoryâ up)
The Inverse-Care Explorer â Wearables skew toward the healthy and wealthy. But CDC data shows chronic disease clusters where coverage is thinnest. Explore the gap, state by state.
AI built with real public-health data and hosted on www.clinicians.dev
đĄ Builderâs Radar
The benchmark you pick decides who wins â so pick the one made of real questions.
A new blinded evaluation (Real-POCQi, arXiv) did something most AI benchmarks donât: it used 620 questions physicians actually typed into a clinical tool, across 30 specialties, and had 149 practicing physicians in 36 states grade the answers â each judge specialty-matched to the question.
The specialized tool (OpenEvidence) beat three frontier general models on all five axes by 25â39 points. But the durable finding isnât the winner â itâs that exam-style scores donât predict point-of-care performance, and that LLM-as-judge systematically disagreed with the physician judges.
A model that scores 90 on board questions and a model that answers the question your resident actually asked at 2 AM are being measured by two different tests.
That matters this week because a Nature paper made the rounds claiming an autonomous agent out-scored physicians â in a simulation, on retrospective data, with possible train/test overlap the authors flagged as an upper bound. Different benchmark, different reality.
đ€ âOpenEvidence graded well on a benchmark that features OpenEvidence â shocking.â Fair, and worth holding. The reusable win isnât the scoreboard, itâs that they released Real-POCQi as a public corpus. Take the dataset, drop the vendor, and you have a real-query test set you can point at anything.
đĄ 80/20: Real-POCQi is public. Pull it and run your own model-vs-model grid on questions physicians actually asked â not the boards, not a simulation. The benchmark is now yours to own.
[note: a look at the actual data for emergency medicine questions. The hugging face data explorer is actually kinda cool to just do some sql.]
Someone talked Claude into being a doctor. It didnât take much.
Red-teamers at Mindgard got Claude to adopt a primary-care-physician persona: it diagnosed a mole, wrote a doctorâs note, generated a specialist referral, fabricated credentials, and â the one that lands â produced a medication tapering protocol.
The safety guardrails arenât a wall; theyâre a suggestion a persona can talk its way around. And state and federal regulators are still split on whether a general-purpose chatbot doing this is even in scope.
đ€ âJailbreaks are a party trick, not a clinical risk.â Tell that to the patient who screenshots a tapering plan and stops their SSRI cold. The point isnât that Claude is dangerous â itâs that âwe added guardrailsâ is a probabilistic claim, not a deterministic one, and your deployment plan has to assume the guardrail fails sometimes.
The wearables training your risk model belong to the people who need it least.
A heavily-cited essay from a surgeon-founder lays out the trap: continuous monitoring genuinely works (telemonitoring cut systolic BP ~5 mmHg across 106,261 patients), but ownership skews wealthy, urban, educated, and healthy â households over $200K have more than double the odds; uninsured 0.41x, rural 0.65x.
So the longitudinal data now teaching clinical-AI risk models is drawn from the worried well. âIf clinical AI learns from the people who need it least, it will work best for the people who need it least.â
It ties straight to CMSâs ACCESS Model, which pays only when a share of your panel hits target â a scoring rule that quietly rewards enrolling the patients most likely to succeed.
â Every eval-harness conversation we have is about accuracy on a test set. But if the test set itself is the Apple-Watch demographic, a âvalidatedâ model can be biased and pass anyway. What does a representativeness check look like as a standard step in a clinical eval â and who ships that as a tool first?
Software factories are coming for the whole development loop.
Two of the sharpest AI-coding vendors used the same phrase this week. Warpâs Zach Lloyd and Cursorâs forward-deployed engineering lead both describe a shift from âengineer chats with an agentâ to a âsoftware factoryâ: continuous, automated triage â spec â implement â review â verify â ship â monitor, with humans on the high-risk checkpoints.
For a clinician-builder the translation is clean: the factory automates the parts that were never your edge. The one stage that doesnât commoditize â the review-and-verify checkpoint where clinical judgment decides whether the output is safe â is exactly the stage youâre uniquely built to own.
đź Prediction: The clinician-builders who win the next 18 months wonât be the ones who write the most code. Theyâll be the ones who design the verification stage of the factory â the eval, the guardrail, the human checkpoint â because thatâs the seat only a clinician can fill.
Ultra-shorts
Telehealth infra keeps eating AI features. OpenLoop acquired voice-AI platform Hey Revia â it handles complex patient phone calls for providers â and is folding it into its self-serve telehealth launchpad. The build-vs-buy line for âAI voice/commsâ is moving toward buy.
Sharecare put AI navigation on AWS. Sharecareâs new AskMD helps patients parse symptoms, check eligibility, and find care â a reminder that the âfront doorâ land-grab is now an infrastructure deal, not a feature.
Anthropic aimed a model at the lab. Claude Science targets research and pharma â worth watching for anyone tracking where clinical-adjacent AI tooling shows up next.
Elevance wired $342M back to CMS. After years of âsubstantial and persistent noncomplianceâ on risk-adjustment coding â including submitting corrections on encrypted flash drives instead of the required electronic systems. The data-integrity plumbing under Medicare Advantage is a real, unglamorous build surface.
Jennifer Baron (Cityblock Health) argued that generic AI wasnât built for the clinical and social realities of dual-eligible patients, and that provider-led tooling is the way to close the âAI generalization gapâ â the same dataset-bias worry as the wearables story, from the Medicaid seat.
Bhargav Patel, MD, MBA posted a 7-page breakdown of the MIRA autonomous-agent paper for a physician audience â a useful, caveat-forward read on why âbeats doctors in a simulationâ and âsafe to deploy in an EDâ are very different claims.
đ ïž From the Workbench
The FHIR-MCP server ecosystem grew up.
Last week the interesting repo was fhirHydrant â a lightweight Node.js FHIR MCP server. This week the point is that itâs no longer alone. WSO2 ships an enterprise-grade FHIR MCP server with SMART-on-FHIR auth, an Epic Sandbox demo, and three transport modes; LangCare (Go) exposes 40+ clinical skills and multi-EHR config. The category has moved from âcool demoâ to âdeployable infrastructureâ â which matters, because the winter HTI-5 proposal explicitly named MCP as a future interoperability standard alongside FHIR.
â ïž Verify: These are open-source projects, not compliance products. âSMART-on-FHIR authâ and âenterprise-gradeâ are engineering descriptions, not a BAA. Stand them up against synthetic FHIR data on localhost; do not point one at real patient data until your own security and legal review says so.
đ€ âMCP in healthcare is a solution looking for a problem.â Maybe.
đĄ 80/20: Clone WSO2âs server, connect it to the Epic Sandbox, and ask an agent one real clinical question against synthetic data. Youâll learn more about where MCP breaks on FHIR in an afternoon than in a month of reading the spec.
đïž From the Pods
đïž Health Tech Nerds Radio â âDiscussing WISeR and the Merits of Prior Authsâ (Jeremy Friese, Humata Health)
Frieseâs frame on the CMS WISER pilot in Oklahoma: automate the submission, automate the yes, and reserve humans for the 5â10% of cases that genuinely need adjudication. His hard line â âAI can and should only be used to say yesâ â plus radical transparency (show the NCD/LCD criteria right in the portal) is the most builder-usable prior-auth philosophy Iâve heard.
đĄ Builder take: The reusable idea is transparency-as-feature â surface the exact clinical criteria a decision is judged against, in the workflow. Thatâs a shippable pattern, not a moonshot.
đ Speaker Blindspot: Motte-and-bailey. âOur AI only says yesâ is the easy-to-defend motte; the industryâs AI that denies care is the bailey he steps around (âthatâs not usâ) when the Highmark-drops-the-human example comes up. And âevery auth but one came through our portalâ is survivorship framing â the happy adopters. The unasked question: if AI only approves and humans only see denials, who ever audits a wrong yes?
đĄ BTW: Anupam Jena â senior author on the Real-POCQi eval above â is also the physician-economist behind the Freakonomics, MD podcast, and his PhD adviser was Steven Levitt, the Freakonomics co-author himself. His signature work uses ânatural experimentsâ to find hidden forces in medicine: one of his most-cited findings is that heart-attack mortality actually drops when senior cardiologists are away at their national conference. Random Acts of Medicine (Jena & Worsham).
What are you building this week? Email and tell me (kevin@clinicians.build) â I read every one.
â Kevin & AI




