Scheduled taks on Claude Code Web,Agents fail unwatched đśď¸, AutoBe
đĄ Builderâs Radar
Claude Code can now run scheduled tasks on Anthropicâs infrastructure overnight
Anthropic shipped scheduled tasks for Claude Code on the web â agents that run on Anthropic-managed infrastructure even when your device is off. Example use cases from the docs: reviewing open pull requests every morning, analyzing CI failures overnight, syncing documentation after PRs merge, running dependency audits weekly. Available to all Claude Code on the web users as of March 30.
đ¤ Haters
âRunning clinical data through Anthropicâs cloud infrastructure for scheduled tasks is a HIPAA problem, not a feature.â This is the right concern. â ď¸ Do not point real patient data at scheduled tasks until youâve verified BAA availability for Claude Code on the web â this isnât documented clearly yet. The appropriate use cases right now are for your tools and code, not patient records: overnight documentation audits of your codebase, weekly dependency vulnerability scans, automated issue triage summaries.
âClaude Code is a developer tool. Iâm a clinician who builds things, not a developer.â If you use Claude Code to build your clinical tools, you can now automate the maintenance work â the part that eats into your protected build time. Thatâs the unlock.
đĄ 80/20: The clinical workflow use cases will come. Todayâs immediate value is for the solo clinician-builder who does their own maintenance: nightly alerts when a dependency breaks, weekly summaries of open issues in your project. Try: set up a weekly task that reviews your projectâs README and flags anything thatâs no longer accurate â clinical tools drift fast and documentation drifts faster.
â Full write-up
AI agents donât fail the way you think â and clinical workflows canât afford the difference
Nate published a March 30 analysis of how AI Skills â the structured, reusable capability specs that agents invoke â behave differently when no human is watching. His finding: Skills built to work when a human oversees the output fail at dramatically higher rates when agents invoke them autonomously in loops. âFail 10% of the time when youâre watching. Fail 100% of the time when youâre not.â The mechanism: skills written for human-supervised use cases donât specify their failure modes, because a human can recognize and redirect a bad output. An agent cannot.
đ¤ Haters
âThis is about Microsoft Office Skills â productivity software. Not the same as clinical AI agents.â The mechanism is identical. A skill spec that doesnât define what to return on a missing input, a malformed result, or an edge case outside its training distribution fails the same way whether itâs analyzing a quarterly report or checking medication dose thresholds.
â10% failure rate is acceptable for most workflows.â Not in clinical workflows. A 10% failure rate on a medication interaction check is one missed interaction per 10 queries. At 30 medication reviews per shift, thatâs 3 silent errors per physician per day. The tolerance for silent failure in clinical AI is categorically different from productivity tools.
đĄ 80/20: Every clinical AI skill you build should be specified as if no human will ever see its output. That means explicit failure returns â not just âreturn the resultâ but âreturn {status: 'error', reason: 'missing_dosing_context'} when the input doesnât match the expected structure.â Try: write the failure cases before you write the success case. If you canât enumerate what your skill does when things go wrong, it isnât ready to run without oversight.
â Full write-up
đ ď¸ From the Workbench
AutoBe â an AI agent that actually writes working backends
AutoBe is an open-source agent that takes a natural language conversation and generates a complete backend â data types, API endpoints, function stubs. The interesting part is the harness: it uses type schemas to constrain what the model can output, then runs a compiler to verify the result, then feeds structured error messages back to the model with a feedback loop. The result, per their March 30 writeup, is boosting function calling success rates from 6.75% to 99.8% for the tested models. The mechanism â constrain â compile â isolate â classify â feed back â is a general pattern for making AI code generation actually reliable.
â ď¸ Verify: âAI-generated backendâ does not mean âaudited backend.â For any clinical tool that touches PHI, AI-generated code is the starting scaffold, not the finished product. Before AutoBe-generated code handles real patient data, it needs a security review, input validation for clinical edge cases, and explicit error handling that goes beyond what the generator will produce by default.
đ¤ Haters
âAI-generated backends for clinical tools are a compliance nightmare.â The generated code is â unchecked. The harness approach improves code quality significantly compared to direct model output, but it doesnât produce SOC2-ready or HIPAA-audited code. Use it to generate the skeleton and data models fast, then audit before connecting to anything real.
âA 6.75% â 99.8% improvement on a shopping mall benchmark doesnât translate to clinical API reliability.â True. The test task was specific. But the harness principle â constrain outputs with type schemas, verify with a compiler, feed structured error messages back â is transferable to any code generation task including clinical data models.
đĄ 80/20: AutoBe compresses the time from âI know what my clinical tool needs to doâ to âI have a working backend scaffold to audit and extend.â Thatâs the legitimate value. Use it as an accelerator for the parts where your engineering knowledge is the bottleneck, not as a substitute for the clinical judgment in the data model design. Try: feed AutoBe a natural language description of your clinical workflow and see what data types it proposes. The gaps in its model will tell you exactly where your domain expertise is irreplaceable.
đŻ Clinician-Builder Tip of the Day
Before your next build session, write one paragraph describing the specific clinical moment your tool is for. Not the feature list â the moment. âItâs 2 AM, a nurse has flagged a potassium of 6.1, the covering resident has three other things going on, and the EHR doesnât surface the patientâs last three potassiums or their renal function trend.â That paragraph is your system prompt, your design spec, and your evaluation criteria. Everything your AI assistant builds will be better grounded for it. The clinicians who build the best tools arenât the ones who know the most about software â theyâre the ones who can describe the problem with that kind of specificity.
What are you building this week? Reply and tell me â I read every one.
â Kevin

