Vibe-Coded Software

TL;DR: AI-generated code demos beautifully and collapses under pressure. I went from 79% → 93.5% test coverage and found 11 production bugs hiding behind "working" features. What looks 80% done is actually 20% complete.

I'm building an AI assistant that helps executives manage email and tasks by learning their priorities—not just automating everything.

Three weeks in, I had 619 tests and 79% were passing. The interface looked polished. Features worked in demos. It felt 80% done.

I was wrong. It was 20% done.

What vibe coding actually is

"Vibe coding" is when AI generates code that looks and feels production-ready but hasn't been hardened against reality. It runs. It demos well. It collapses under load.

I call it Potemkin code—impressive facade, hollow behind.

How I learned this

I had 619 tests. 79% were passing. Most would call that "pretty good."

I decided to fix the failing tests. Not to hit 100% coverage—just to see what they were actually telling me.

The tests weren't flaky. They were correct. My code was wrong.

Over three days, I went from 79% → 88% → 93.5% test coverage. Each wave of fixes revealed real production bugs:

Gmail webhook handler: Critical logging call outside try-catch block. Any logging error would crash the entire webhook, silently failing email ingestion.
Cache invalidation: Using wrong user ID. Dashboard showed stale data after sync.
Test helper schemas: Mismatched column names and data types. Tests passed, production failed.
API contracts: Using gmailMessageId instead of gmail_id. Schema drift everywhere.
Enum values: 'analyzing' isn't a valid status. The database only accepts 'pending', 'completed', or 'failed'.

11 production bugs total. All hiding behind a 79% pass rate that felt "good enough."

The lesson: what looks 80% done is actually 20% complete. The tests were showing me exactly where the vibes ended and reality began.

How to spot it

Vibe-coded software has telltale patterns:

Hollow architecture dressed in nice UI. The interface is polished. The buttons work. But the underlying system can't handle edge cases, errors, or scale.

Code that "runs" but doesn't handle reality. It works for the happy path. Network failures? Malformed input? Concurrent users? "We'll add that later."

Test suites that pass... mostly. 70-80% coverage with "flaky tests" that get ignored. Those aren't flaky tests. They're X-rays showing where your code is broken.

Requires massive effort to harden for production. The last 20% of the work takes 80% of the time. That's when you discover the foundation was vibes, not engineering.

Why it happens

Because demos feel like shipping.

You show the feature. People are impressed. Dopamine hits. You move on.

But demos aren't systems. Demos prove the idea. Systems survive contact with reality.

The unglamorous work—error handling, monitoring, load testing, edge cases—doesn't demo well. So it gets skipped or deferred.

And failing tests? Those get dismissed as "flaky" because fixing them means admitting the code isn't as done as it looks.

The cost

Vibe-coded software creates technical debt that compounds:

User trust erodes fast. One data loss incident. One unexplained failure. Trust takes months to build and seconds to lose.

Operational burden explodes. Every incident requires manual intervention. On-call becomes a nightmare. Production becomes a minefield you tiptoe through instead of confidently operating.

Future features slow down. The foundation can't support new work. Every addition risks collapse. Velocity drops as the codebase ossifies around its hidden bugs.

The "mostly working" trap is expensive. Those 11 bugs I found? Each one could have caused a production outage. Finding them before users did saved weeks of debugging and incident response.

The alternative

Shipping reliable software means ignoring the dopamine rush of demo days and finishing the unglamorous 80% that actually matters.

It means:

Treating test failures as bug reports, not noise. If a test fails consistently, your code is wrong, not the test. Don't skip it. Fix it and learn what it's telling you.

Testing under load before calling it done. Not "we'll load test later." Now. Before launch.

Handling errors gracefully instead of assuming success. Network failures, malformed input, race conditions—these aren't edge cases. They're Tuesday.

Instrumenting for observability so you can debug production. Monitoring isn't phase 2. It ships with the feature or the feature doesn't ship.

Writing runbooks for when (not if) things break. If you can't explain how to debug it, you're not done building it.

This work doesn't generate Twitter engagement. It doesn't impress at demo day.

But it's the difference between software that ships and software that endures.

What I'm building in instead

I'm applying this to the AI assistant:

Error states first, not last. Every Gmail sync includes: network timeout handling, OAuth refresh logic, rate limit backoff. Not "we'll add that later."

Monitoring ships with features. Email analysis includes: processing time tracking, model confidence logging, failure rate alerts. If I can't see it breaking, it's not done.

Tests tell the truth. 93.5% pass rate, but those 40 failing tests? They're showing me where personalization logic breaks with edge case data. I'm fixing them, not ignoring them.

The standard: it launches when executives can rely on it, not when it demos impressively.

What I'm watching for

I'm three weeks into this build. The vibe-coding trap is real—AI can generate hundreds of lines per day, but without discipline, you're just generating impressive garbage faster.

The metric I'm tracking isn't lines of code or features shipped. It's "can this run in production for a week without me babysitting it?"

Not there yet. But that's the bar.