Kaggle seeds 124 clinical papers 🧪, Ambience goes platform 🏗️, Hippocratic puts voice on the floor 🎙️
124 Clinical Prediction Papers Were Built on Kaggle Datasets That Nobody Actually Checked
Graham Walker, MD posted a thread on Saturday that deserves to rearrange how you look at clinical ML literature. Australian researchers investigated two Kaggle datasets widely used in “medical AI” work: a stroke prediction dataset and a diabetes dataset that together have more than 400,000 downloads. Neither has verifiable provenance. The stroke dataset’s uploader explicitly notes it should not be used for research. The diabetes dataset has 7 percent duplicate rows and only 18 distinct HbA1c values across a file claiming 100,000 patients — a distribution that is clinically impossible unless the “data” was generated, not collected.
Those two files have seeded at least 124 published clinical prediction papers, roughly 1,500 citations, and one medical device patent co-held by USC and Caltech. Walker frames it as literature laundering: once a sketchy source is cited inside a review article, it stops being sketchy. It becomes “the literature.” Every gatekeeping layer — the repository, the modeler, the peer reviewer, the review-article author — assumed the previous layer had done the due diligence. None of them had.
Walker and Joseph Habboushe, MD MBA — the MDCalc co-founders — are keynoting the Coalition for Health AI summit on this. They are the right messengers because they already do this job for a living: they are clinicians who read the files.
😤 Haters
“This is a peer review problem. It is not an AI problem.” The reviewer failure is real, but peer review is a six-week human process. Model training is a six-hour script. When a sketchy dataset gets cited five times in good journals, it becomes part of the training corpus for the next generation of clinical LLMs — not because the model chose to trust it, but because the citation graph did. This is how bad provenance becomes bad weights.
“Every field has junk datasets. The signal comes through.” In most fields, junk data produces models that fail in obvious ways. In medicine, a model trained on 100,000 synthetic-looking patients can still return clinically plausible predictions — because the distribution was designed to look like clinical data. The failure mode is invisible until someone deploys it. That is the one field where “it looks right” is not enough.
“OK, but this is someone else’s mess. What am I supposed to do with it?” If you are a clinician-builder, this is the job description. The single most common failure mode in clinical ML is not the model architecture. It is the stuff that happened before the first line of Python got written. The people best-positioned to audit that are the ones who already read labs, charts, and imaging reports skeptically for a living. If the CHAI keynote surfaces one takeaway, let it be this: the profession that catches 18 unique HbA1c values across 100,000 rows in 30 seconds is the one sitting at the point of care.
💡 80/20: Before you train, fine-tune, or even cite a clinical dataset, open the CSV. Histogram the key continuous variables. Check cardinality. If a dataset that claims to represent 100,000 patients has 18 distinct values for HbA1c or 12 distinct ages, it did not come from patients. Try: pick one “medical” Kaggle dataset you’ve seen cited in a paper this year, run df.nunique() and df.describe(), and decide in five minutes whether the distribution is clinically possible. That is the audit that should have existed upstream.
📡 Builder’s Radar
Ambience Drew Its Real Map — and Ambient Scribe Is Just One of Five Squares
Ambience Healthcare used its Apex Summit this week to publish a five-domain roadmap: Clinical Workflows, Revenue Cycle/Integrity, Patient Experience, Care Orchestration, and Clinical Research. Point-of-care coding moves from retrospective to real-time, now offered on performance-based contracts where Ambience shares financial risk on coding accuracy. A new patient agent called Kait runs between visits. “Reasoning traces” from clinical encounters get pitched as structured data for computational phenotyping. The company also published enterprise metrics most AI vendors won’t: more than 80 percent clinician utilization, NPS above 60, and a claimed 3:1 operating-margin ROI.
😤 Haters
“Every scribe vendor is announcing a platform. This is a pivot deck.” It is a platform announcement from a company that still sells a scribe — so yes, there is a narrative shift. But performance-based coding contracts are a real commercial commitment. A vendor willing to put its fee at risk on coding accuracy is telling you something different than a vendor who ships a slide deck. The contract model is the part worth watching.
“Reasoning traces as a research substrate sounds like a privacy bomb.” It can be. De-identification at the encounter-narrative level is famously hard — the text itself is the identifier. If Ambience is positioning these traces as research infrastructure, the BAA and data-use terms matter more than the feature announcement. Ask how reasoning traces flow back to the research pipeline before any of this reaches a real chart.
💡 80/20: The scribe is not the product anymore. It is the wedge that gets the contract. Reframe: when evaluating any ambient vendor this year, look past the transcript and ask what the company has committed to ship in the other four domains — and what it will put at risk financially on each.
Hippocratic Put a Voice Agent in the Hallway and Another One on the Phone
Hippocratic AI launched two voice products this week. AI Front Door is a cross-channel patient agent — phone, text, app — that holds longitudinal patient memory across scheduling, billing, and care coordination. Nurse Co-Pilot sits on the inpatient floor, handling admit and discharge education, medication teach-back, and caregiver engagement. Both products were co-developed with Cleveland Clinic, OhioHealth, and Cincinnati Children’s. The pitch: 1 to 4 hours returned per nursing shift.
😤 Haters
“Voice AI in a hospital is a liability magnet. One wrong med-adherence answer and it’s national news.” Not unreasonable. But Hippocratic’s framing — explicit human-clinician checkpoints and EHR documentation of the voice interaction — is the right shape for inpatient deployment. The question is not whether the voice agent is safe in a vacuum; it is whether the checkpoint is enforced in practice when the nurse is busy.
“’1 to 4 hours per shift’ is a vendor claim, not a study.” Correct. The number is the ceiling, not the median. The useful framing is: does the voice agent reduce the specific burdensome tasks your nurses already flag, or does it move time around without reducing it? A 30-minute time-motion study on two shifts will tell you more than the press release.
💡 80/20: The patient call center and the inpatient education binder are both about to become agent-addressable surfaces. Try: before buying into any voice product, sit at the nursing station for an hour and log which tasks the existing staff actively want handed off. The answer is usually three specific ones, not “everything.”
Keebler Health Raised $16M to Build Risk Adjustment That Was Born After Transformers
Keebler Health closed a $16M Series A led by Flare Capital with Sands Capital participating, total raised $23M since 2023. The pitch: the risk-adjustment category was built on legacy NLP that retrofitted itself onto LLMs. Keebler was built LLM-native from day one. The claimed opening: roughly 80 percent of the clinical data that matters for HCC coding is unstructured — prior notes, scanned reports, specialist letters — and only 59.4 percent of chronic conditions get consistently captured across EHR sources. Point-of-care insights sit inside the existing workflow rather than as a retrospective chart pull. RADV audit readiness is the near-term roadmap as CMS sharpens scrutiny.
😤 Haters
“’LLM-native’ is a marketing word. The wrappers all look the same from outside.” True if you are skimming the homepage. Less true when you read the architecture. The legacy-NLP vendors have model chains, ontology lookups, and rule engines glued together over a decade. An LLM-native stack collapses most of that into retrieval plus a single reasoning pass — which is cheaper, faster to iterate, and weirdly easier to audit. The moat argument cuts both ways, though: if it’s easy for Keebler to build, it’s easy for the incumbents to rebuild.
“Risk adjustment is under CMS audit pressure for over-coding, and the whole pitch is ‘find more HCCs.’” This is the sharpest objection. The thing that makes this category interesting also makes it dangerous. If the RADV audit readiness story is a real product — not just a bolt-on feature — Keebler has to show that the HCCs the model surfaces are defensible in a chart-level audit. Otherwise it is just a faster way to book revenue a payer will claw back.
💡 80/20: The ground-up-LLM positioning is going to eat a lot of legacy NLP in clinical revenue cycle this year. Reframe: when your health system’s RCM vendor pitches its “new AI features,” ask whether the underlying data pipeline was rebuilt or whether a transformer was bolted onto the old one. The answer predicts the product’s ceiling.
🧰 Builder’s Tip
Mindset: The Audit Is the Product
Every clinician-builder I know spends the first month of any project wishing they could skip the boring part. Read the CSV. Histogram the continuous variables. Pull 20 charts and compare to what patients actually report. Spend a Saturday reading your vendor’s BAA.
Skip it, and you get the 124-papers problem — clean-looking artifacts built on a foundation nobody checked.
Don’t skip it. The audit is not the overhead. The audit is the defensible part. A dashboard, a model, a clinical workflow — any of these can be cloned in a weekend by someone with a better LLM and a faster GPU. What cannot be cloned is the clinician who read the files, asked the right question, and wrote down why the numbers were wrong.
That is why domain expertise is the scarce input. Not because the coding is hard. Because the skepticism is.
If you want one habit to take into the week: open the data before you open the notebook. Distributions first. Models second. If the histogram looks impossible, the model will be impossible too — it just won’t tell you.
What are you building this week? Reply and tell me — I read every one.
— Kevin


