Fake citations hit 1-in-277 📚, Claude agents learn to dream 💭, MDCalc lesson in building

May 08, 2026

One in 277 Biomedical Papers Now Contains a Fabricated Citation

A reference-integrity audit published in The Lancet analyzed 2.5 million biomedical papers and 125.6 million structured references spanning January 2023 to February 2026. The finding: fabricated citations in PubMed-indexed papers have increased 12-fold in two years. In 2023, the rate was 1 in 2,828 papers. By early 2026, it’s 1 in 277. The inflection point tracks exactly with widespread LLM adoption. Previous studies estimate 30–69% of LLM-generated references in biomedical contexts are fabricated — plausible-sounding but fictitious.

This matters beyond academic integrity. Every clinical decision support tool, every RAG pipeline pulling from PubMed, every evidence synthesis agent — they all assume the literature is real. A fabricated citation doesn’t just waste a researcher’s time. It poisons the training data for the next generation of clinical AI. And the harder problem, as STAT noted, isn’t the wholly fabricated references — it’s the ones that are inaccurate, biased, or incomplete but pass surface-level checks.

😤 Haters

“This is an academic publishing problem, not a clinical AI problem.” It’s both. If your CDS tool retrieves evidence from PubMed and a fabricated citation makes it into the retrieval set, the tool is confidently citing a paper that doesn’t exist. The evidence layer is shared infrastructure. Corruption upstream flows downstream.

“LLMs aren’t the only cause — paper mills existed before ChatGPT.” True. The audit can’t definitively attribute the spike to AI. But the 12x acceleration in two years is not explained by paper mills alone. The rate was stable for years and inflected sharply in 2024. Something changed, and the timing is not subtle.

“Just validate your references — problem solved.” Reference validation at scale is genuinely hard. CrossRef DOI lookups catch some fabrications, but LLM-hallucinated references often have plausible DOIs, correct-looking journal names, and authors who actually publish in the field. Automated detection tools exist but haven’t been deployed at journal-intake scale. The gap between the problem’s growth rate and the detection tooling is widening.

💡 80/20: If you’re building anything that retrieves or cites medical literature, add a reference-validation layer before it reaches the user. CrossRef API + PubMed E-utilities can verify that a cited paper actually exists. It adds seconds to the pipeline and catches the most egregious fabrications. Try: run your last 50 retrieved references through a DOI-existence check and see what comes back.

Claude Agents Can Now Dream — And That Changes the Feedback Loop

Anthropic launched three features for Managed Agents: Dreaming (a scheduled process that reviews past sessions, extracts patterns, and curates memories for self-improvement), Outcomes (define success criteria and let a separate grader evaluate the agent’s work), and Multiagent Orchestration (spawn specialist subagents that work in parallel). In internal testing, Outcomes improved task success by up to 10 percentage points, with the largest gains on harder problems.

😤 Haters

“Self-improving agents in healthcare is a regulatory nightmare.” Today, yes. But the pattern — structured review of past performance to improve future performance — is literally what clinical M&M conferences do. The question isn’t whether AI should learn from mistakes. It’s whether the learning loop has the right guardrails: audit trails, human oversight of what the agent “learned,” and the ability to roll back bad lessons.

“This is enterprise tooling, not clinician-builder relevant.” Disagree. Dreaming is available in the API. If you’re building a clinical agent that runs repeatedly — a daily lab reviewer, a discharge summary checker, a prior auth bot — dreaming means it gets better at your specific workflow without you manually tuning prompts every week.

💡 80/20: If you have a Claude-based agent that runs the same task repeatedly, Outcomes alone is worth testing. Define what “good” looks like for your task (completeness, accuracy, format), let the grader evaluate, and measure whether the agent converges. Try: write a 5-point rubric for your agent’s output and enable Outcomes for a week.

MDCalc Ships Its Biggest Frontend Update in 20 Years

Graham Walker, MD — ER doc, MDCalc co-founder — announced the redesign this week as the platform turns 20. New header, cleaner navigation, modernized UI. MDCalc remains one of the purest examples of clinician-built software at scale — created by an EM resident, used by millions, still led by a practicing physician. No VC pivot to enterprise SaaS. No rebrand as an “AI-powered clinical decision platform.” Just a tool that works, maintained by the person who built it.

😤 Haters

“A frontend update isn’t news.” For most companies, no. But MDCalc is a 20-year-old clinician-built tool that’s stayed clinician-built. In an era where every clinical tool gets acquired and rebranded, that longevity is the story. The update signals the platform is investing in its future, not just coasting.

“MDCalc is just calculators — the future is AI-powered CDS.” MDCalc’s calculators are evidence-based, transparent, and trusted because clinicians can see exactly what goes in and what comes out. That transparency is the feature most AI CDS tools lack. There’s a reason clinicians still use Wells criteria instead of asking ChatGPT for a PE probability.

💡 80/20: MDCalc’s staying power is a design lesson: build something specific, keep it simple, maintain it for decades. If you’re starting a clinical tool, the MDCalc model — solve one thing well, earn trust through transparency, grow laterally — beats the “platform play” pitch deck every time.

Anthropic SpaceX Compute Deal: 220K GPUs, 300+ MW

Anthropic partnered with SpaceX for access to the Colossus 1 data center in Memphis — 300+ megawatts, 220,000+ NVIDIA GPUs coming online this month. Claude Code rate limits doubled immediately across all paid plans. Peak-hours throttling removed for Pro and Max. This joins $5GW Amazon, $5GW Google/Broadcom, and $50B Fluidstack deals. Anthropic grew 80x last quarter.

Quebec Drops Epic for Province-Wide EMR

Quebec abandoned its planned Epic deployment for a centralized provincial EMR. One of the highest-profile international rejections of Epic’s expansion strategy. For US health systems, the question isn’t whether Epic is a good EHR — it’s whether the single-vendor, high-switching-cost model translates when the buyer has different procurement norms and data sovereignty requirements.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?