If it needs heroics, it's broken

You can tell a lot about a company by how it talks about its on-call engineers.

Some companies celebrate the hero who stayed up all night fixing production. They write blog posts about it. They give the person a bonus.

This is a red flag.

Heroics are a symptom

When a system regularly requires heroics to function, you don’t have a team problem. You have a design problem.

Heroes step in when:

The system doesn’t handle edge cases
Alerts fire too late or not at all
Recovery requires manual intervention
Knowledge lives in one person’s head

Every time someone saves the day, the system is telling you something. It’s telling you where it’s fragile. The question is whether you listen.

The operational trap

The problem with heroes is they work. At least in the short term.

When you have someone who can always fix things, there’s less pressure to fix the underlying issues. The hero becomes load-bearing. The system depends on their continued presence and continued willingness to absorb its failures.

This is unsustainable. People leave. People burn out. People take vacation. A system that only works because of a specific person is not a system.

What good looks like

The best operations teams I’ve seen don’t celebrate heroes. They investigate heroic moments.

When someone has to step in with manual intervention, the question isn’t “who do we thank?” It’s “what do we fix so this doesn’t happen again?”

Good systems:

Fail safely and predictably
Alert early, with enough context to act
Have automated recovery for known failure modes
Spread knowledge across the team
Treat on-call as boring

Boring is the goal. Boring means the system is working as designed.

The cultural problem

The hero culture is seductive because it feels good. Being the person who saves the day is satisfying. Getting recognized for it is validating.

But optimizing for individual recognition often means under-investing in systemic improvements. The incentives point the wrong direction.

The best teams I’ve worked with flip this. They recognize people for making systems more reliable, not for fighting fires. They celebrate when on-call is quiet. They treat a heroic intervention as a bug to be fixed, not a story to be told.

Designing out heroics

If you want to reduce heroics, you have to design for it:

Build runbooks before you need them
Automate manual recovery steps
Set alerts that fire early enough to act calmly
Distribute knowledge aggressively
Do regular chaos testing to find the gaps

The goal isn’t to eliminate humans from operations. It’s to make their interventions deliberate and rare, not desperate and frequent.

If your system needs a hero, your system is telling you something. Listen.