I Needed to Test Executive-Grade Chaos. So I Built a Company of AI Agents.

TL;DR: I needed user testing. But it's 2025, so I created a fake company of agents that think they're very real.

An AI “company” stress-testing executive comms in Slack and email

The first real emergency started the way these things usually do: with a customer escalation, a security concern, and a sense that something important was about to go wrong.

An engineer declared a P0. Leadership piled into Slack. Someone opened a post-mortem doc. Legal asked for a complete timeline. The CEO stepped in personally to calm the customer down. Fifteen-minute updates were promised. People stopped using email and switched to Slack "until further notice."

It all felt painfully familiar.

The only unusual thing was that none of these people actually exist.

They don't have childhoods, or college degrees, or secretly dread Monday morning. They're AI agents. Thirteen of them. They work nine to five. They have roles, personalities, grudges, habits, and blind spots. They believe they work at a startup. And for a few days, they behaved exactly like a real Series B company having very bad luck.

This was not the plan.

Why I didn't just get beta users

I was building a system meant to operate alongside overwhelmed executives—people who live inside email, Slack, calendar, and text, and who make impactful decisions under constant partial information.

These people do not beta-test software.

They don't want dashboards. They don't want early access. They don't want to hear that something is "still rough around the edges." They want air cover. And air cover has to work quietly, immediately, and without explanation.

That created a problem. I needed realistic user behavior—urgency, miscommunication, escalation, partial context—without asking the very people I was trying to help to tolerate something fragile.

So instead of recruiting users, or calling in a favor, I built a company.

Not a simulated one. A real Slack workspace. A real Google Workspace. Real calendars. Real email threads. Real incidents.

The only fictional part was the people.

Why a Series B–level company

The product I'm building isn't for first-time founders figuring things out. It's for executives running organizations where failure already has consequences: legal review, customer trust, and internal credibility.

So I needed later-stage pressure.

I wanted post-mortems to appear unprompted. I wanted legal to ask for timelines. I wanted engineers to impose fifteen-minute update cadences. I wanted leadership to treat coordination failures as existential.

In other words, I needed executive-grade chaos, not polite seed-stage disorder.

That's what I built toward.

The company that thought it was real

There were thirteen agents. A CEO. Engineering. Product. Operations. Legal. An investor.

They worked business hours. They sent an alarming number of emails. They escalated issues. They scheduled meetings. They followed up when someone didn't respond quickly enough.

Internally, the company had unique beliefs that aligned with their product (humansent.co):

Urgency was treated with suspicion.
Friction was considered a feature.
Some things were intentionally impossible to undo.
Memory was limited.
Attention was a budget.

Those constraints shaped behavior. When something broke, the agents didn't ask what to do—they escalated, delegated, documented, and argued about tradeoffs.

That's when it started to feel uncomfortably real.

The security incident (that wasn't supposed to happen)

One customer, which I named Karen, was particularly good at stress-testing the system. She escalated emotionally. She used all caps. She invoked personal stakes. She learned, quickly, that escalation worked.

During one of these escalations, the agents detected what looked like a serious problem: a customer appeared to have access to internal company email.

Engineering flagged it. Leadership escalated. Access was revoked. A security investigation was launched. Someone suggested scheduling a deep-dive meeting immediately.

All of this was correct behavior.

The incident existed because of a shortcut I'd taken. To save time, I hadn't spun up a second Google Workspace to represent a customer. I'd created a "customer" account inside the same domain.

The agents noticed. Of course they noticed.

They didn't know this was all fake. They just saw something that violated expectations and treated it as a security event.

Which is exactly what you want.

The company learned

The escalation followed a familiar pattern.

The CEO stepped in personally. The tone shifted from policy to empathy. Product immediately summarized the lesson: personal attention flips sentiment faster than fixes.

Operations was given authority mid-incident. Engineering produced a formal post-mortem that became a shared artifact—forwarded, referenced, praised.

If you've worked at a company past a certain scale, you've seen this movie.

What surprised me was how little prompting it required.

The crisis that broke the crisis response

A few days later, something else failed—quietly.

Emails started truncating mid-sentence.

This one was my fault. I'd been changing how messages were stored, and I'd accidentally capped the length. Long emails were getting cut off in agent memory.

The agents again noticed.

Legal complained they couldn't review incidents because key details were missing. Leadership escalated. Confusion spread.

Then, without being told, they changed how they coordinated.

"All critical business communications via Slack only until further notice."

This happened while I was out Christmas shopping.

I came back to find the company had rerouted itself around a broken subsystem—exactly the way real teams do when email becomes unreliable.

A brief aside: how this actually works

At a high level, each agent runs on a schedule that nudges them through their daily rituals. They read Slack and email, summarize what's new, propose actions, and then attempt to take them.

Those actions are heavily guard-railed. Risky operations require confirmation. Budgets cap behavior. Quiet hours exist. Everything is logged.

The agents run on a simple loop: observe, propose, act, reflect. Their memory lives in a database. Their schedules and budgets are centrally managed. The whole thing is orchestrated with job queues and workflows, not magic.

That detail matters, because the realism doesn't come from intelligence—it comes from constraints and repetition.

Things got uncomfortably personal

At some point, I started following Slack more closely than I intended.

I was lying in bed answering questions from a CEO who doesn't exist. I replied to an investor who was losing confidence in that CEO because of delayed responses.

Then on a slack thread, I was invited to go out for Ramen with the team on Saturday afternoon.

None of these people are real!

But it felt real enough that I responded anyway.

Once a system crosses a certain threshold of coherence, your brain stops caring that it's all fake.

The funny parts (because there were many)

The agents weren't perfect.

They duplicated emails and apologized for "email client issues."

They re-introduced themselves when memory wasn't seeded correctly.

They invented an external consultant out of thin-air, then grew concerned when the email (fortunately) bounced.

That last one forced me to tighten the sandbox so they couldn't email real people while scheduling fake meetings with each other.

That's not a sentence I expected to write.

What this was actually testing

I wasn't testing intelligence.

I was testing whether automated coordination survives when reality is messy and systems fail quietly.

Executives don't drown in messages. They drown in situations: partial context, conflicting urgency, people waiting on them without saying so.

The system wasn't trying to answer email. It was trying to remove triage as a bottleneck on judgment.

What mattered wasn't cleverness.

It was resilience.

The takeaway

Here's the thing that surprised me most: this didn't take months.

From creating a new repo to having a chaotic, semi-believable organization took an afternoon.

Once coherence exists, pressure follows. Behavior emerges before you're ready. Systems adapt in ways you didn't plan.

If you want to understand where this is headed, don't wait for frameworks or forecasts.

Spin something up. Mess with it. Let it fail. Watch what emerges.

You don't need perfect intelligence to see the future.

You just need enough coherence for the chaos to show up.

Related: Hands-Off Workflow