OpenAI shipped 1M lines without writing code 🏗️, Kintsugi got FDA clearance and went bankrupt 💀, Patch Wave 🌊

May 05, 2026

OpenAI Shipped 1M Lines of Code With Zero Human-Written Code

Seven engineers on OpenAI’s Frontier Product Exploration team spent five months building an internal product — an Electron app — with literally zero manually-written code. Every line: application logic, tests, CI configs, documentation, observability, tooling. All written by Codex agents. They call the discipline “harness engineering“: designing the constraints, feedback loops, and documentation structures that channel AI agents toward reliable output. The team estimates they shipped at 10x the speed of hand-coding.

The key enabler wasn’t a better model. It was application legibility — making UI state, logs, and metrics queryable by the agent. They replaced massive instruction files with an AGENTS.md table of contents using progressive disclosure. They enforced strict architecture layers (Types → Config → Repo → Service → Runtime → UI) with custom linters that gave remediation instructions directly to the agent. They treated technical debt as compound interest, with background agents continuously scanning for drift.

😤 Haters

“This only works for internal tools nobody has to maintain.” The whole point of harness engineering is maintainability. The architecture linters, the automated cleanup PRs, the progressive-disclosure docs — that IS the maintenance. The system maintains itself because the maintenance was designed into the harness from day one.

“Real software is messier than an internal experiment.” Sure. And real clinical care is messier than a simulation study. The point isn’t that the experiment is production medicine. The point is that the engineering role shifted from writing code to designing environments. That shift is real, and it’s permanent.

“Clinicians can’t do this — they don’t have Codex and Symphony orchestration.” You don’t need Symphony. You need the mental model. Application legibility. Progressive disclosure. Architecture layers that give the agent guardrails. Claude Code with a well-written AGENTS.md file and a linter that says “you violated the schema, here’s how to fix it” is the same primitive at hobby scale.

💡 80/20: The job didn’t change titles — it changed surfaces. The leverage for clinician-builders isn’t learning to code faster. It’s learning to design environments where agents code well. Try: write an AGENTS.md for your next project before writing any application code. Describe the architecture layers, the naming conventions, the test expectations. Let the agent read that instead of reading your mind.

Kintsugi Got FDA Clearance, Clinical Validation, and Still Went Bankrupt

Kintsugi spent seven years and ~$30M building a vocal biomarker platform that detected depression and anxiety from 20 seconds of audio — pitch, speech rate, pauses. Clinically validated. FDA cleared. Twenty seconds was all it needed. They shut down in February 2026 and open-sourced everything on Hugging Face. The failure mode: $16M spent on four years of FDA presubmissions against a venture model that demands returns in three to five years. Category creation on a regulatory timeline that doesn’t fit the capital structure.

😤 Haters

“If the tech worked, they should have found a market.” The tech did work. The market wasn’t a tech problem — it was a reimbursement problem. No CPT code for vocal biomarkers. No payer pathway. No EHR integration surface that made it easy to deploy in a 15-minute primary care visit. The clinical evidence was necessary but not sufficient.

“This means digital mental health doesn’t work.” No, this means venture-funded regulatory plays in novel categories have a structural timing mismatch. Kintsugi’s models are now on Hugging Face. Someone with a different capital structure — a health system lab, an academic group, a clinician with a different go-to-market — can pick them up. The technology survived the company.

💡 80/20: Validation ≠ viability. Before spending three years on FDA, ask: does a reimbursement pathway exist for what I’m building? If the answer is “we’ll create the category,” your capital structure needs to match that timeline — and VC probably doesn’t. Reframe: build on surfaces where the payer pathway already exists and the technology is the bottleneck, not the other way around.

NCSC Warns of AI-Driven “Patch Wave” — Decades of Buried Flaws Surfacing at Once

The UK’s National Cyber Security Centre warned organizations to prepare for a “patch wave”: AI is now unearthing decades of buried software flaws faster than anyone can patch them. Anthropic’s Claude Mythos found 2,000+ previously unknown vulnerabilities in seven weeks of testing — including a 27-year-old OpenBSD bug and a 17-year-old FreeBSD remote code execution flaw. A separate AI-discovered Linux flaw (”Copy Fail”) grants full root access on every major distro since 2017 with a 732-byte exploit script. Over 99% of the Mythos-discovered flaws remain unpatched.

😤 Haters

“This is a UK government blog post, not a US healthcare story.” Every hospital has some Linux somewhere. Every medical device runs on these affected operating systems. When the patch wave hits, health systems running legacy infrastructure — which is most of them — will face a choice between operational disruption (patching) and operational risk (not patching). That’s a clinician-builder’s problem.

“AI finding vulns is just security researchers working faster.” The scale is the qualitative shift. 2,000+ in seven weeks. The gap between “flaw discovered” and “flaw exploited” is shrinking from weeks to hours. If you’re building tools that touch any infrastructure that hasn’t been patched in the last 90 days — and in healthcare, that’s most infrastructure — you need to account for this threat surface.

Perplexity’s “Agent Skills” Design Document

Perplexity published a detailed research article on how they design, refine, and maintain modular Agent Skills for their frontier products. Key insight: unlike traditional software, Skill development is shaped by real user queries and continuous evaluation across multiple models, not upfront requirements. Each Skill has an inherent “cost” (latency, context, error surface) that must be justified by its quality contribution. The 20-minute read covers hierarchies, efficiency-quality tradeoffs, and iteration patterns.

⚠️ Verify: This is an engineering blog post, not a peer-reviewed methodology. The patterns are worth studying, but Perplexity’s scale and model access differ from what a solo clinician-builder has. Adapt the mental models; don’t cargo-cult the infrastructure.

😤 Haters

“This is irrelevant to healthcare — it’s a search company’s internal process.” The architecture patterns are domain-agnostic. If you’re building any agent that has multiple capabilities (a clinical assistant that can do med-rec AND search literature AND draft notes), the cost-vs-quality tradeoff per Skill is exactly the design problem you face. Perplexity solved it at scale; you can apply the principles at hobby scale.

“I don’t need modular Skills — I just need one good prompt.” And then your one good prompt gets context-stuffed, slows down, and starts hallucinating on edge cases. Modularity isn’t premature optimization. It’s how you keep each capability sharp as the system grows.

💡 80/20: The core principle: every capability you add to an agent has a cost (more context = more latency = more confusion surface). Before adding the next feature to your clinical AI tool, ask: does this Skill earn its cost in quality?

Tool Spotlight: AutoRound — Quantize Medical Models for Local Use in 10 Minutes

AutoRound is Intel’s open-source quantization toolkit for LLMs and VLMs. It achieves high accuracy at ultra-low bit widths and can quantize a 7B parameter model in 10 minutes on a single GPU. If you’re running Ollama or LM Studio with medical models (MedGemma, BioMistral, clinical fine-tunes from Hugging Face), AutoRound lets you create custom GPTQ/AWQ quantizations optimized for your specific hardware — often with better quality than the default quantizations available on model hubs.

Why this matters for clinician-builders: The default GGUF quantizations on Hugging Face are one-size-fits-all. AutoRound lets you create a quantization calibrated on medical text (discharge summaries, clinical notes, medication lists) so the model retains more clinical accuracy at smaller sizes. A 4-bit quantization calibrated on clinical text outperforms a generic 4-bit quantization on clinical tasks — often matching the 8-bit generic version.

Try this weekend:

pip install auto-round
auto-round --model your-medical-model --bits 4 --group_size 128 \
  --calib_dataset your-clinical-corpus.txt --output_dir ./quantized

Run on synthetic clinical notes. Compare the output against the default quantization on 10 clinical questions. All local, all synthetic, zero PHI risk.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?