AI benchmarks are rigged 📊, TCET dies after one device 💀, Claude metered
The “AI Beats Doctors” Headline Is Built on a Rigged Rubric
A Science paper from Brodeur et al. claimed AI (o1-preview) “eclipsed most benchmarks of clinical reasoning” — 89% vs 34% for physicians. Allen Lim, MD just published a rigorous eLetter critique that takes the headline apart.
The Grey Matters rubric rewards comprehensive enumeration — listing every possible diagnosis and test — with no penalty for excess. An AI that outputs everything scores high. A focused physician who orders the three tests that actually matter scores low. That’s not reasoning. That’s a word count contest.
In the paper’s strongest head-to-head experiment, AI’s advantage was 17 points at triage. At admission — when full clinical information is available — the gap shrinks to not statistically significant. Five of six experiments used historical physician comparators, introducing era effects and scorer drift.
😤 “So AI isn’t good at clinical reasoning?” No — AI is genuinely useful for enumeration, which is a real clinical task. The critique is about the measurement, not the tool. When someone tells you AI scored 89% on reasoning, ask: “89% on what rubric, compared to whom, under what conditions?” That’s the question builders need to internalize before deploying clinical AI with the claim that it “outperforms physicians.”
💡 80/20: The next time a vendor tells you their clinical AI “matches physician-level reasoning,” ask for the evaluation rubric. If the benchmark rewards verbosity over focused clinical judgment, it’s measuring the wrong thing. Read Lim’s critique — it’s the most useful framework for evaluating clinical AI benchmarks I’ve seen this month.
TCET Is Dead After Processing One Device in Two Years. RAPID Is Next.
CMS’s Transitional Coverage for Emerging Technologies program was supposed to fast-track Medicare coverage for breakthrough devices. In two years since its August 2024 launch, it processed exactly one device against a self-imposed cap of five per year.
The design flaw was baked in from the start. TCET was scoped specifically for PMA-class devices — the ones that already have investor support and a path to coverage. The De Novo and 510(k) devices that actually need coverage support were excluded by design. CMS’s coverage group has 35-37 people managing nearly $1 trillion in Medicare spending.
Only 12.3% of FDA Breakthrough Device Designations reach market authorization (128 of 1,041). The bottleneck was never FDA clearance alone — it was the gap between clearance and coverage.
😤 “RAPID will be different.” Maybe. But TCET’s structural problem — a tiny team overwhelmed by the scope of Medicare spending — doesn’t get fixed by renaming the program. Watch for whether RAPID’s scope includes De Novo devices and whether CMS-FDA cooperation becomes real-time rather than sequential.
The AI Coding Agent Wars Just Got a Price Tag
Anthropic introduced metered programmatic usage — every Claude subscription now includes API credits equal to the plan amount. OpenAI responded by offering enterprise customers two free months of Codex for switching within 30 days.
Ramp spending data shows Anthropic at 34.4% of businesses vs OpenAI at 32.3% — the first time Anthropic has led.
For clinician-builders, the meta-story matters more than the pricing details. Coding agent infrastructure is commoditizing fast. Cline open-sourced its rebuilt SDK with CLI, agent teams, and scheduled jobs. LangChain shipped SmithDB for observability. The tooling layer that sits between you and the model is becoming abundant and cheap.
💡 80/20: If you’re paying for AI coding tools, compare your Claude and OpenAI spend against actual output.
What are you building this week? Reply and tell me — I read every one.
— Kevin


