Hands-Off Workflow
Building Trust Through Guardrails
TL;DR: 12 commits, 6,500 lines of code, 4 hours 6 minutes — merged cleanly.
I let an AI write 6,500 lines of code while I made coffee.
Until now, I've been editing exactly how you probably expect: one change at a time, approving every command, reviewing every edit, signing off on every deploy.
After weeks of tightening specs, CI gates, and rollback paths, I felt confident enough to try something different: running an autonomous development branch. Like handing the keys to a teenager — but with a dashcam, a speed limiter, and a curfew.
The experiment: with all the guardrails in place, could an agent safely handle a fairly complex cleanup task end-to-end?
This was honestly too weird for me. To avoid feeling useless I started drafting this blog post.
The Setup
The rules are boring by design:
-
Spec first: objectives, non-goals, touchpoints, tests, rollback plan
-
Agent executes the spec, not vibes
-
Checks at every boundary: build → type-check → lint → unit → integration → smoke
-
Observability before merge
-
Feature flags + reversible rollout
-
No production writes, no secrets, no direct main merges
Control becomes structural instead of personal. The system enforces correctness, not me hovering over every keystroke.
What Shipped (feat/dashboard-hardening)
This branch hardened the analytics loop:
-
End-to-end time filters:
1h / 24h / 7d / 21d -
SLA-based pipeline states:
received → queued → analyzing → summarized → delivered -
Auto-clear "Analyzing" banners after 15 minutes with recovery guidance
-
Short TTLs + event-based cache invalidation for fresher metrics
-
Full tracing from webhook → queue → analysis → dashboard
-
Rolled out behind flags with error tracking via Sentry; clean-main policy enforced
Run stats
| Metric | Result | |:--|:--| | Commits | 12 | | Duration | ≈ 4 hours 6 minutes | | Diff | 35 files changed (+6,503 / −46) | | Status | Merged (clean after 1 TypeScript fix) | | Quality gates | ✅ Tests passed · ✅ Sentry clean · ✅ Reversible rollout |
Good engineering with tight rails, executed faster than a human loop could manage.
The Workflow, Codified
workflow:
phases: [dashboard-hardening, tier1-data-protection, permissions-phase-3-5]
gates:
- build
- type-check
- lint
- unit
- integration
- smoke
- staging-sentry-24h-clean
database:
environment: develop
migrations: forward-only
cutovers: feature-flags + dual-write (when needed)
policies:
- no-prod-db-writes
- no-secret-changes
- no-merge-to-main
- observability: sentry spans + structured logsThe agent is constrained, and that's what makes it reliable.
Inputs Define Outcomes
The workflow only works because clarity comes first.
Before a single line is generated, the agent receives a structured spec that defines:
-
Scope: what's in, what's out
-
Data boundaries: what can be touched, logged, or written
-
Expected tests: what "done" means in measurable form
-
Rollback plan: exactly how to reverse the change if any gate fails
The spec is a contract. Once execution begins, every decision traces back to that input. Process engineering replaces prompt engineering.
Where the Decisions Came From
The prompt didn't decide any of this.
When the branch hit an ambiguous trade-off — for example, whether to refactor now or later — it fell back to a predefined governance layer: a twelve-person synthetic leadership panel.
Each persona represents a distinct perspective: CTO for technical depth, CRO for revenue impact, SRE for reliability, plus Product, Security, Compliance, Ops, Data, Finance, Design, QA, Customer Success, and Strategy. They vote through weighted heuristics embedded in the workflow. (Read more about how this works)
The agent doesn't "think" — it consults that composite leadership process to resolve conflicts exactly as a real cross-functional team would. That structure turned ambiguous choices into governed outcomes.
The result was consistency.
Here's how the system routes inputs through governance to measurable outcomes:
flowchart TD
subgraph Input["Human Input / Specification"]
A1["Feature Spec<br>(objectives, non-goals, tests, rollback)"]
A2["Prompt Execution Request"]
end
subgraph Governance["Synthetic Leadership Process"]
B1["12-Persona Panel"]
B2["Weighted Decision Heuristics"]
B3["Fallback Rules & Trade-off Framework"]
A2 --> B1
B1 --> B2
B2 --> B3
end
subgraph Agent["Autonomous Dev Branch"]
C1["Plan Parsing & Context Setup"]
C2["Code Generation<br>+ Inline Testing"]
C3["Instrumentation<br>(Sentry spans, structured logs)"]
B3 --> C1
C1 --> C2
C2 --> C3
end
subgraph CI["CI / CD Guardrails"]
D1["Build / Type-check / Lint"]
D2["Unit / Integration / Smoke Tests"]
D3["Staging Validation<br>(24-hour Sentry clean)"]
D4["Feature Flags & Reversible Rollout"]
C3 --> D1 --> D1 --> D2 --> D3 --> D4
end
subgraph Outcome["System Outcomes"]
E1["Merged PR<br>(12 commits, 6,500 LOC, 4h 6m)"]
E2["Traceable Decisions<br>via Synthetic Panel Logs"]
D4 --> E1
B3 --> E2
end
style Governance fill:#f5f7ff,stroke:#3366ff,stroke-width:1.5px
style Agent fill:#f8f9fa,stroke:#999,stroke-width:1px
style CI fill:#f5fff7,stroke:#3ba55c,stroke-width:1.5px
style Input fill:#fffdf5,stroke:#d0a500,stroke-width:1px
style Outcome fill:#f9f9ff,stroke:#666,stroke-width:1pxKey Decisions That Built Trust
-
Correctness before refactor: Option A+ (hardening + security gates); defer structural changes
-
DB-side filtering: single
from/tocontract; honest data -
SLA banners: clear expectations over false real-time
-
Observability first: spans + structured logs before polish
-
Idempotency keyed by Gmail
messageId: zero churn -
Unified query contract: fewer bugs, simpler tests
Those weren't random design choices. They came from the synthetic leadership process.
Confidence was the goal. Speed followed.
What Made It Work
Credibility comes from constraints.
-
Real specs instead of prompts
-
CI gates instead of vibes
-
Governance models instead of gut feeling
-
Feature flags with reversibility
-
Measurable outcomes with audit paths
The agent moved fast because the system made speed safe.
We'll publish hard numbers after the 24-hour staging run: runtime, LOC by area, tests added, coverage delta, and Sentry trace stats.
Takeaway
Humans set direction and define correctness. Agents execute under constraint.
That's how autonomy compounds — with process, not hope.