Citation without retrieval in OpenEvidence? 📑, MIT teaches AI to say 'I don't know' 🤷, Anthropic crosses $1T 🚀
Citation without retrieval in OpenEvidence? An oncologist documented it across five.
Allen Lim, MD — practicing oncologist running the Oncology AI Lab Substack — ran a structured five-trial stress test against OpenEvidence, the clinical AI that roughly 40% of US clinicians use daily. The corpus was deliberately the kind of evidence-based oncology a busy physician actually has to engage: TROPION-Breast02, DESTINY-Breast05, OptiTROP-Lung04, SERENA-6, INAVO120. The methodology was Bishal Gyawali’s JCO Oncology Practice primer — protocol detail, supplementary appendix, statistical analysis plan, ICMJE disclosure layer. The verbatim prompts are published.
The finding is not that OpenEvidence is wrong. It is that the citations rendered as if the model had read the trial, and the retrieval was the abstract or the main published text only. On TROPION-Breast02, OpenEvidence built a head-to-head recommendation table off a 250-word abstract and admitted on direct challenge that “the source available in the database is the published abstract only.” On DESTINY-Breast05, it confidently quoted protocol-level dose-modification rules for grade 2 ILD, then conceded the database does not include the trial protocol or the supplementary appendix — adjacent documents (DESTINY-Breast03, the FDA Enhertu label, a JAMA Oncology review) had been retrieved and presented as if from the protocol. The author’s term for this is citation laundering. On OptiTROP-Lung04, Table S3 — sites of progression by treatment arm, CNS progression 26% vs 11% — was unreachable. On SERENA-6, the SAP and the prespecified PFS2 alpha threshold were unreachable. On INAVO120, the ICMJE conflict-of-interest forms (named medical writer, Roche-employed co-authors, sponsor-funded statistics) were unreachable. In three of five cases, the model proceeded as if it had the document anyway.
😤 Haters
“This is one oncologist running prompts on five trials — call it methodology before you call it a finding.” Single-author study with self-published prompts, fair. The thing that makes it more than a blog post is the structure: a primer-anchored framework (Gyawali’s), verbatim prompts for reproducibility, a coherent failure taxonomy (citation without retrieval, citation laundering, abstract-as-protocol), and a primary clinical use case where the difference matters in real care decisions. If your reaction is “I want a bigger n,” good — go run the same five prompts on your favorite clinical AI this weekend. The methodology is sitting there. The reason it lands as a finding is that it’s reproducible by anyone with two hours and an NEJM password.
“The fix is not architectural — it’s just that OpenEvidence’s index is small. License the supplements and this goes away.” Partly true, and worth doing. The deeper read is that adding documents to the index does not fix the harder failure: a model presenting an adjacent document as if it were the requested document. A bigger corpus with the same retrieval layer just means the citation-laundering surface gets larger and harder to detect. The fix has to be at the layer where the model represents what it has. That layer is what MIT just shipped a method for — see Builder’s Radar below.
“OpenEvidence is a tool, not an oracle — clinicians know to verify.” This is the line every vendor uses and the line clinicians want to believe. A tool used 18 million times in a single month inside the actual clinical workflow has crossed the threshold where verification by every user on every consultation is not a real social model. The right read of “tool, not an oracle” is that the tool has to know its limits, not that we trust ten thousand busy clinicians to pull the supplementary appendix every time the model confidently cites a protocol it never read.
MIT shipped a way to teach AI to say “I don’t know.” 90% reduction in calibration error.
MIT CSAIL published Reinforcement Learning with Calibration Rewards on April 22 — the paper itself, “Beyond Binary Rewards”, is on arXiv. The mechanism is small: standard RL rewards a correct answer and penalizes a wrong one and treats every right answer the same whether the model was confident or guessing. RLCR adds a single term to the reward — a Brier score, the gap between the model’s stated confidence and its actual accuracy — so the model now gets penalized for confident-and-wrong and for under-confident-and-right. On a 7B model, calibration error fell by up to 90% across multiple benchmarks; accuracy was maintained or improved on both training tasks and unseen ones. When asked things it did not know, the trained model returned a confidence score around 0.02 and asked for more context instead of hallucinating.
😤 Haters
“Calibration in a benchmark is not calibration in the wild.” Reasonable, and the canonical critique. The reason this paper still matters for clinical AI is that the failure mode the OpenEvidence stress test documented is exactly the one calibration training targets: confident-and-wrong on retrieval-laundered citations. A clinical AI fine-tuned with a calibration term against a held-out corpus of “supplements were not in the index” prompts would, in principle, return 0.02 instead of a recommendation table. Whether real-world calibration generalizes from benchmarks is the right open question. Whether this training signal addresses the failure mode the field has is much less open.
“This is a 7B model and a single paper — clinical AI vendors will not adopt it for two years.” The two-year line is probably right for a vendor’s main model and probably wrong for the eval and post-training layer. Calibration as a metric on an internal eval harness is something a serious clinical AI team can stand up in a sprint. The teams that get to the next ONC or FDA conversation with calibration numbers in the deck will be in a different procurement-shortlist class than the teams quoting accuracy alone.
💡 80/20: Accuracy is what gets you in the demo. Calibration is what keeps you on the procurement shortlist.
Anthropic crossed $1T and shipped Claude inside Adobe, Blender, and seven more creative tools. The workflow-control thesis just acquired a second case study.
On April 28, Anthropic crossed $1T in valuation, passing OpenAI by paper market cap. Same day, Anthropic shipped Claude for Creative Work — native MCP-style connectors for Adobe Creative Cloud, Blender, Autodesk Fusion, Ableton, Splice, SketchUp, Resolume, and Canva. Claude inside Blender can debug 3D scenes; Claude inside Photoshop can batch-edit layered files; Claude inside Splice can search a royalty-free library and generate stems. Adobe shipped its own Adobe for Creativity Claude connector the same morning, granting access to 50+ pro tools across Photoshop, Premiere, Firefly, and InDesign. The Neuron’s framing is the right one for healthcare: the race for “smartest model” is over, the race is now for “deepest workflow,” and whichever lab gets buried inside the tools you already use wins.
😤 Haters
“Connectors are dressing — Adobe still owns the file format.” Mostly right, and also exactly the point. The connector is not how Anthropic wins Adobe. It’s how Adobe stays Adobe in an agent-mediated world. The clinical-AI parallel is precise: Epic does not need the best ambient scribe to dominate ambient scribing. Epic needs to be the connector that whichever scribe a hospital uses has to plug into. Christina Farr said exactly this on a podcast the same week — bet against Epic’s first-party AI scribe; do not bet against Epic’s good-enough integrated scribe two years from now. The workflow-control thesis is now showing up in two different industries inside the same news cycle.
💡 80/20: “Whichever [AI] lab gets buried inside the tools you already use wins” is the most useful sentence anyone has written about agent strategy this year.
Christina Farr asked the question every healthcare-AI deck skips.
On the Digital Health Inside Out podcast this week, Christina Farr — formerly CNBC, OMERS, Manatt, now Scrub Capital and Lifers — went straight at the labor-displacement argument that lives quietly under every healthcare-AI pitch deck: “Every time I hear about it in almost a gleeful way of somebody’s margins are about to improve, it makes me feel physically ill. Where is all of our empathy? Where did it go?” The interview also delivered the cleanest sentence anyone has said out loud about Epic and AI scribes — “I would bet against Epic coming up with a solution to compete in an extremely nimble way. But I wouldn’t count against them building a solution that’s good enough eventually.” The exception she names is Heidi Health — bootstrapped, focused on Australia / Canada / France / UK first, 2.4M+ consultations a week across 110 languages and 190 countries.
😤 Haters
“Empathy is not a strategy — somebody is going to take cost out of US healthcare and you cannot moralize the curve away.” Half right. The labor-displacement curve is real, and pretending otherwise is its own bad faith. The other half is that which labor gets displaced and what fills the freed time are choices, not laws of physics. A scribe that quietly redirects three minutes per patient encounter into eye contact and shared decision-making is the same technical thing as a scribe that quietly redirects three minutes per encounter into a sixteenth patient. The vendor that names which one they’re optimizing for has done the work. The vendor that hides behind “AI lifts all boats” is selling the second one with the language of the first.
“Heidi is small — quoting them as the Epic counter-thesis is generous.” Small in the US, very big in the markets they prioritized. The point is not that Heidi is going to beat Epic. The point is that the market structure — bootstrapped, internationally distributed, language-pluralistic, narrow-and-deep — is a viable shape for a clinician-builder competing in a category that looks foreclosed from a US-only vantage. The lesson is the geography of the bet, not the company.
What are you building this week? Reply and tell me — I read every one.
— Kevin


