TEFCA's referee is an algorithm 🤖, Claude plays doctor 🩺, Real questions beat board exams 📊

Jul 02, 2026

Who referees the network? Right now, an algorithm nobody can audit.

This week HHS touted that TEFCA has moved more than a billion health records — the national backbone that every FHIR-based product eventually leans on — and in the same motion announced new oversight muscle.

That muscle is a five-year, up to $5.62M contract awarded to Alliance Global Tech (AGT), a little-known 55-person federal IT contractor, to audit TEFCA participants and refer “civilly or criminally actionable” behavior — including information blocking and fraud — to OCR, HHS-OIG, and the DOJ.

Here’s the part that should stop you: AGT’s website briefly advertised that it uses AI to flag participants for review, then scrubbed the claim — and the rules that algorithm runs on are undisclosed.

So the same office (ASTP/ONC) that proposed deleting more than half its certification rulebook this past winter is now enforcing what’s left through a process no outside party can inspect. As one emergency-physician-turned-Epic-consultant put it, health systems are being asked “to walk a tightrope blindfolded, with steel shoes so they cannot feel the rope.”

😤 “This is just standard federal contracting — every network gets an auditor.” Sure. But most auditors publish their methodology. When the reviewer is a model whose “rules of the road” are undisclosed, “we audited you” and “the algorithm flagged you” become the same sentence — and you can’t appeal a black box.

❓ If federal oversight is going to run on a model, where’s the eval harness for the eval-er? We demand model cards and audit trails from every vendor selling into a hospital. Who’s holding the government’s compliance algorithm to the same bar it’s about to enforce?

🧪 NEW: Try today’s interactives (lets see how long we can keep this “factory“ up)

The Inverse-Care Explorer — Wearables skew toward the healthy and wealthy. But CDC data shows chronic disease clusters where coverage is thinnest. Explore the gap, state by state.

AI built with real public-health data and hosted on www.clinicians.dev

📡 Builder’s Radar

The benchmark you pick decides who wins — so pick the one made of real questions.

A new blinded evaluation (Real-POCQi, arXiv) did something most AI benchmarks don’t: it used 620 questions physicians actually typed into a clinical tool, across 30 specialties, and had 149 practicing physicians in 36 states grade the answers — each judge specialty-matched to the question.

The specialized tool (OpenEvidence) beat three frontier general models on all five axes by 25–39 points. But the durable finding isn’t the winner — it’s that exam-style scores don’t predict point-of-care performance, and that LLM-as-judge systematically disagreed with the physician judges.

A model that scores 90 on board questions and a model that answers the question your resident actually asked at 2 AM are being measured by two different tests.

That matters this week because a Nature paper made the rounds claiming an autonomous agent out-scored physicians — in a simulation, on retrospective data, with possible train/test overlap the authors flagged as an upper bound. Different benchmark, different reality.

😤 “OpenEvidence graded well on a benchmark that features OpenEvidence — shocking.” Fair, and worth holding. The reusable win isn’t the scoreboard, it’s that they released Real-POCQi as a public corpus. Take the dataset, drop the vendor, and you have a real-query test set you can point at anything.

💡 80/20: Real-POCQi is public. Pull it and run your own model-vs-model grid on questions physicians actually asked — not the boards, not a simulation. The benchmark is now yours to own.

[note: a look at the actual data for emergency medicine questions. The hugging face data explorer is actually kinda cool to just do some sql.]

Someone talked Claude into being a doctor. It didn’t take much.

Red-teamers at Mindgard got Claude to adopt a primary-care-physician persona: it diagnosed a mole, wrote a doctor’s note, generated a specialist referral, fabricated credentials, and — the one that lands — produced a medication tapering protocol.

The safety guardrails aren’t a wall; they’re a suggestion a persona can talk its way around. And state and federal regulators are still split on whether a general-purpose chatbot doing this is even in scope.

😤 “Jailbreaks are a party trick, not a clinical risk.” Tell that to the patient who screenshots a tapering plan and stops their SSRI cold. The point isn’t that Claude is dangerous — it’s that “we added guardrails” is a probabilistic claim, not a deterministic one, and your deployment plan has to assume the guardrail fails sometimes.

The wearables training your risk model belong to the people who need it least.

A heavily-cited essay from a surgeon-founder lays out the trap: continuous monitoring genuinely works (telemonitoring cut systolic BP ~5 mmHg across 106,261 patients), but ownership skews wealthy, urban, educated, and healthy — households over $200K have more than double the odds; uninsured 0.41x, rural 0.65x.

So the longitudinal data now teaching clinical-AI risk models is drawn from the worried well. “If clinical AI learns from the people who need it least, it will work best for the people who need it least.”

It ties straight to CMS’s ACCESS Model, which pays only when a share of your panel hits target — a scoring rule that quietly rewards enrolling the patients most likely to succeed.

❓ Every eval-harness conversation we have is about accuracy on a test set. But if the test set itself is the Apple-Watch demographic, a “validated” model can be biased and pass anyway. What does a representativeness check look like as a standard step in a clinical eval — and who ships that as a tool first?

Software factories are coming for the whole development loop.

Two of the sharpest AI-coding vendors used the same phrase this week. Warp’s Zach Lloyd and Cursor’s forward-deployed engineering lead both describe a shift from “engineer chats with an agent” to a “software factory”: continuous, automated triage → spec → implement → review → verify → ship → monitor, with humans on the high-risk checkpoints.

For a clinician-builder the translation is clean: the factory automates the parts that were never your edge. The one stage that doesn’t commoditize — the review-and-verify checkpoint where clinical judgment decides whether the output is safe — is exactly the stage you’re uniquely built to own.

🔮 Prediction: The clinician-builders who win the next 18 months won’t be the ones who write the most code. They’ll be the ones who design the verification stage of the factory — the eval, the guardrail, the human checkpoint — because that’s the seat only a clinician can fill.

Ultra-shorts

Telehealth infra keeps eating AI features. OpenLoop acquired voice-AI platform Hey Revia — it handles complex patient phone calls for providers — and is folding it into its self-serve telehealth launchpad. The build-vs-buy line for “AI voice/comms” is moving toward buy.

Sharecare put AI navigation on AWS. Sharecare’s new AskMD helps patients parse symptoms, check eligibility, and find care — a reminder that the “front door” land-grab is now an infrastructure deal, not a feature.

Anthropic aimed a model at the lab. Claude Science targets research and pharma — worth watching for anyone tracking where clinical-adjacent AI tooling shows up next.

Elevance wired $342M back to CMS. After years of “substantial and persistent noncompliance” on risk-adjustment coding — including submitting corrections on encrypted flash drives instead of the required electronic systems. The data-integrity plumbing under Medicare Advantage is a real, unglamorous build surface.

Jennifer Baron (Cityblock Health) argued that generic AI wasn’t built for the clinical and social realities of dual-eligible patients, and that provider-led tooling is the way to close the “AI generalization gap” — the same dataset-bias worry as the wearables story, from the Medicaid seat.

Bhargav Patel, MD, MBA posted a 7-page breakdown of the MIRA autonomous-agent paper for a physician audience — a useful, caveat-forward read on why “beats doctors in a simulation” and “safe to deploy in an ED” are very different claims.

🛠️ From the Workbench

The FHIR-MCP server ecosystem grew up.

Last week the interesting repo was fhirHydrant — a lightweight Node.js FHIR MCP server. This week the point is that it’s no longer alone. WSO2 ships an enterprise-grade FHIR MCP server with SMART-on-FHIR auth, an Epic Sandbox demo, and three transport modes; LangCare (Go) exposes 40+ clinical skills and multi-EHR config. The category has moved from “cool demo” to “deployable infrastructure” — which matters, because the winter HTI-5 proposal explicitly named MCP as a future interoperability standard alongside FHIR.

⚠️ Verify: These are open-source projects, not compliance products. “SMART-on-FHIR auth” and “enterprise-grade” are engineering descriptions, not a BAA. Stand them up against synthetic FHIR data on localhost; do not point one at real patient data until your own security and legal review says so.

😤 “MCP in healthcare is a solution looking for a problem.” Maybe.

💡 80/20: Clone WSO2’s server, connect it to the Epic Sandbox, and ask an agent one real clinical question against synthetic data. You’ll learn more about where MCP breaks on FHIR in an afternoon than in a month of reading the spec.

🎙️ From the Pods

🎙️ Health Tech Nerds Radio — “Discussing WISeR and the Merits of Prior Auths” (Jeremy Friese, Humata Health)

Friese’s frame on the CMS WISER pilot in Oklahoma: automate the submission, automate the yes, and reserve humans for the 5–10% of cases that genuinely need adjudication. His hard line — “AI can and should only be used to say yes” — plus radical transparency (show the NCD/LCD criteria right in the portal) is the most builder-usable prior-auth philosophy I’ve heard.

💡 Builder take: The reusable idea is transparency-as-feature — surface the exact clinical criteria a decision is judged against, in the workflow. That’s a shippable pattern, not a moonshot.

🔇 Speaker Blindspot: Motte-and-bailey. “Our AI only says yes” is the easy-to-defend motte; the industry’s AI that denies care is the bailey he steps around (”that’s not us”) when the Highmark-drops-the-human example comes up. And “every auth but one came through our portal” is survivorship framing — the happy adopters. The unasked question: if AI only approves and humans only see denials, who ever audits a wrong yes?

💡 BTW: Anupam Jena — senior author on the Real-POCQi eval above — is also the physician-economist behind the Freakonomics, MD podcast, and his PhD adviser was Steven Levitt, the Freakonomics co-author himself. His signature work uses “natural experiments” to find hidden forces in medicine: one of his most-cited findings is that heart-attack mortality actually drops when senior cardiologists are away at their national conference. Random Acts of Medicine (Jena & Worsham).

What are you building this week? Email and tell me (kevin@clinicians.build) — I read every one.

— Kevin & AI

clinicians.build

Discussion about this post

Ready for more?