The Problem With Heroes In Software Development
Imagine your web application goes down in the middle of the night. It’s 2 AM, but your business is global. You have users in every time zone. They’re angry. They're unable to purchase things on your website or are canceling their subscriptions. Money is being lost every minute your web application is down.
Suddenly, one of your developers is on the case! This developer shows a little anxiety and curses a lot (it is 2 AM after all), but eventually the problems are resolved. The application is running again and money is flowing into your business. Despite this kind of situation happening from time to time, you are comforted knowing that you always have this developer to save the day. This developer is your hero.
No one in this situation should feel comfort though. It is an extremely risky position to be in for a company. It negatively impacts everyone who works at that company. And despite the immense feeling of importance, it hurts the hero developer as well.
Let’s start with the problems for the company. If there are known stability issues with an application being live, then they should be addressed. Crisis management almost never addresses the underlying issues. Crisis management is about making it through another day. The “fixes” are temporary. There is still the real problem that should be fixed.
Being comforted by having a hero on call makes those real problems seem less urgent than they really are. There is a strong temptation to put off fixing those problems in favor of working on new revenue generating activities.
But crisis management is not the same as crisis prevention.
An application that frequently goes down earns a bad reputation among users. This makes retention difficult and will eventually make it harder to gain new users. There is also the possiblity where the hero is unavailable. There are many ways this can happen. What if they get sick? Or a relative gets sick? Or they go on vacation? Or they just decide they’re tired and quit?
In all those cases, your application is now down for a much longer period (hours instead of minutes or days instead of hours). You could argue that someone else on the team can take on the mantle of hero. That leads us to the problems for everyone else on the team.
What would you do if there was a crisis at your job and you couldn’t do anything about it?
Those in this situation can get a feeling of helplessness. This can damage a person’s confidence which often affects job performance. People can ask the hero to teach them to solve problems as well, but why would a company prioritize that education when it won’t prioritize fixing the real problems?
More importantly, people can become reliant on the hero as well. They decide that their silo is their own work and the hero is the one who handles crises. They don’t need to learn how to help out.
This is a dangerous attitude to have for developers. The best way to prevent crises, or at least make them more manageable, is to build software that makes it easy to prevent or manage crises. How can a developer know how to properly account for a crisis if they haven’t been in one?
They can’t.
Even if the hero explains the technical details after a crisis resolves, there will be important details missed. The most important one is the emotional state the hero was in after waking up at 2 AM in the morning in response to the crisis.
Developers make dozens of tiny decisions every day in their code. Many of those decisions can save precious minutes, if not hours, in a crisis. Someone who just hears about something that will help in a crisis will not truly understand its importance as much as someone who experiences the result of a bad decision while in a crisis.
For example: many developers who have not experienced a crisis will tend to write poor error messages. Their code is littered with messages like “Error occurred.” Where did the error occur? Who did it occur for? Even something as small as “Error occurred for User 123 at url /home” makes a huge difference. But someone who has never had to fix critical issues will not understand how big a difference it makes. They would have never felt the emotional impact of these seemingly small changes in their code.
Writing code that handles well in a crisis is an essential skill for developers. When developers rely on a hero to solve crises, they are denying themselves the opportunity to develop the skill to write better code. That will impact the company in the short term and the developers’ careers in the long term.
Lastly, there are the issues for the hero. Having the skills to save the day make the hero valuable in a way. But going from crisis to crisis will have the hero only develop the skill to resolve crises. The hero will not develop their ability to prevent crises. If the company can’t prioritize crisis prevention, the hero won’t have time to practice crisis prevention. This affects the hero’s career because their value is tied to a single company. They have less value to another company which affects their ability to move on if they find they are unhappy.
And they will be unhappy.
Being the hero has a number of quality of life issues. Want to take a vacation? Sure, but always have the oncall phone ready and be prepared to take out a laptop at a moment's notice. Want to build something new and interesting? Sure, but do it in between crises. Want to make dinner plans with friends or go on a date? Sure, but be prepared to cancel. Just in case.
I’ve been both the hero and the developer reliant on the hero to save the day. I can honestly say that it is worse being the hero. The praise and the adrenaline can feel great at first, but it doesn’t last. Eventually, there is only exhaustion, resignation, and anger.
“How could that break again?!”
Resolving crises without crisis prevention has a diminishing return on growing as a developer. Eventually you just end up solving the same kind of crisis over and over. There’s no learning in that. I also ended up in a situation where I needed a crisis even though I hated it. I was so used to resolving them that I didn’t know how to function when there wasn’t a crisis. How does one go about doing work uninterrupted? So strange!
So how can we get rid of the culture of hero developers?
The idea is simple, even if implementing it can be challenging. Treat the notion of having a hero as seriously as you would a crisis. When a crisis happens, resolve it. But also take at least one step in preventing something just like it from happening again. It isn’t a guarantee of prevention, but slow progress is still progress. You will eventually get there.
Also prioritize education for the rest of the team. Involve multiple people in every crisis. Maybe that’s just investigating an issue in parallel with the hero. Maybe that’s pairing up directly with the hero. But involve them. Everyone learns better by doing. It may feel like wasted time since the hero can do thing faster, but having multiple people capable of resolving future crises is worth that cost.
Neither of these steps are easy to take. It’s never easy trying to think of the long term when you have an emergency. But for all the reasons stated above, these are important steps to take because they prevent the vicious cycle of having only a single hero to solve crises because the hero is the only one who has ever solved a crisis. It is worth the cost.