Scheduled taks on Claude Code Web,Agents fail unwatched 🕶️, AutoBe

Mar 31, 2026

📡 Builder’s Radar

Claude Code can now run scheduled tasks on Anthropic’s infrastructure overnight

Anthropic shipped scheduled tasks for Claude Code on the web — agents that run on Anthropic-managed infrastructure even when your device is off. Example use cases from the docs: reviewing open pull requests every morning, analyzing CI failures overnight, syncing documentation after PRs merge, running dependency audits weekly. Available to all Claude Code on the web users as of March 30.

😤 Haters

“Running clinical data through Anthropic’s cloud infrastructure for scheduled tasks is a HIPAA problem, not a feature.” This is the right concern. ⚠️ Do not point real patient data at scheduled tasks until you’ve verified BAA availability for Claude Code on the web — this isn’t documented clearly yet. The appropriate use cases right now are for your tools and code, not patient records: overnight documentation audits of your codebase, weekly dependency vulnerability scans, automated issue triage summaries.

“Claude Code is a developer tool. I’m a clinician who builds things, not a developer.” If you use Claude Code to build your clinical tools, you can now automate the maintenance work — the part that eats into your protected build time. That’s the unlock.

💡 80/20: The clinical workflow use cases will come. Today’s immediate value is for the solo clinician-builder who does their own maintenance: nightly alerts when a dependency breaks, weekly summaries of open issues in your project. Try: set up a weekly task that reviews your project’s README and flags anything that’s no longer accurate — clinical tools drift fast and documentation drifts faster.

→ Full write-up

AI agents don’t fail the way you think — and clinical workflows can’t afford the difference

Nate published a March 30 analysis of how AI Skills — the structured, reusable capability specs that agents invoke — behave differently when no human is watching. His finding: Skills built to work when a human oversees the output fail at dramatically higher rates when agents invoke them autonomously in loops. “Fail 10% of the time when you’re watching. Fail 100% of the time when you’re not.” The mechanism: skills written for human-supervised use cases don’t specify their failure modes, because a human can recognize and redirect a bad output. An agent cannot.

😤 Haters

“This is about Microsoft Office Skills — productivity software. Not the same as clinical AI agents.” The mechanism is identical. A skill spec that doesn’t define what to return on a missing input, a malformed result, or an edge case outside its training distribution fails the same way whether it’s analyzing a quarterly report or checking medication dose thresholds.

“10% failure rate is acceptable for most workflows.” Not in clinical workflows. A 10% failure rate on a medication interaction check is one missed interaction per 10 queries. At 30 medication reviews per shift, that’s 3 silent errors per physician per day. The tolerance for silent failure in clinical AI is categorically different from productivity tools.

💡 80/20: Every clinical AI skill you build should be specified as if no human will ever see its output. That means explicit failure returns — not just “return the result” but “return {status: 'error', reason: 'missing_dosing_context'} when the input doesn’t match the expected structure.” Try: write the failure cases before you write the success case. If you can’t enumerate what your skill does when things go wrong, it isn’t ready to run without oversight.

→ Full write-up

🛠️ From the Workbench

AutoBe — an AI agent that actually writes working backends

AutoBe is an open-source agent that takes a natural language conversation and generates a complete backend — data types, API endpoints, function stubs. The interesting part is the harness: it uses type schemas to constrain what the model can output, then runs a compiler to verify the result, then feeds structured error messages back to the model with a feedback loop. The result, per their March 30 writeup, is boosting function calling success rates from 6.75% to 99.8% for the tested models. The mechanism — constrain → compile → isolate → classify → feed back — is a general pattern for making AI code generation actually reliable.

⚠️ Verify: “AI-generated backend” does not mean “audited backend.” For any clinical tool that touches PHI, AI-generated code is the starting scaffold, not the finished product. Before AutoBe-generated code handles real patient data, it needs a security review, input validation for clinical edge cases, and explicit error handling that goes beyond what the generator will produce by default.

😤 Haters

“AI-generated backends for clinical tools are a compliance nightmare.” The generated code is — unchecked. The harness approach improves code quality significantly compared to direct model output, but it doesn’t produce SOC2-ready or HIPAA-audited code. Use it to generate the skeleton and data models fast, then audit before connecting to anything real.

“A 6.75% → 99.8% improvement on a shopping mall benchmark doesn’t translate to clinical API reliability.” True. The test task was specific. But the harness principle — constrain outputs with type schemas, verify with a compiler, feed structured error messages back — is transferable to any code generation task including clinical data models.

💡 80/20: AutoBe compresses the time from “I know what my clinical tool needs to do” to “I have a working backend scaffold to audit and extend.” That’s the legitimate value. Use it as an accelerator for the parts where your engineering knowledge is the bottleneck, not as a substitute for the clinical judgment in the data model design. Try: feed AutoBe a natural language description of your clinical workflow and see what data types it proposes. The gaps in its model will tell you exactly where your domain expertise is irreplaceable.

🎯 Clinician-Builder Tip of the Day

Before your next build session, write one paragraph describing the specific clinical moment your tool is for. Not the feature list — the moment. “It’s 2 AM, a nurse has flagged a potassium of 6.1, the covering resident has three other things going on, and the EHR doesn’t surface the patient’s last three potassiums or their renal function trend.” That paragraph is your system prompt, your design spec, and your evaluation criteria. Everything your AI assistant builds will be better grounded for it. The clinicians who build the best tools aren’t the ones who know the most about software — they’re the ones who can describe the problem with that kind of specificity.

What are you building this week? Reply and tell me — I read every one.

— Kevin

clinicians.build

Discussion about this post

Ready for more?