Don’t Play the Blame Game in IT: The Privilege of Making Mistakes
Regardless of the scale of your organization or the hardware you use, glitches happen. Sometimes network access drops, a system crashes, or something else goes wrong. Naturally, when a problem is detected, the responsible staff takes every possible measure to restore functionality as quickly as possible.
After restoration, an analysis of the incident’s causes is conducted, and measures are developed to prevent similar situations in the future. All of this can be described as the “Incident Management Process.”
Typically, companies use one of the following approaches to incident response:
Startups or Small Firms
In the event of any problems or outages, everyone turns to a specific specialist. That person analyzes the situation and resolves the issues.
Mid-sized Business
A larger company where IT infrastructure support is handled by several employees or even departments. In these situations, general chat groups (in Slack, Teams, etc.) are often created where incidents are discussed collectively. The employee who discovers a problem in their area of responsibility provides colleagues with detailed information and shares a subjective estimate of the time needed for recovery.
Enterprise (Large Business)
In large organizations, incident resolution is a fully established process. Critical systems are monitored 24/7. In the event of an incident, the Monitoring and Response team immediately organizes an emergency conference (a “war room”) and invites the relevant specialists. After the incident is resolved, the suspected cause is recorded, and the data is sent to a special commission. Furthermore, incident review board meetings are held periodically to thoroughly analyze every case. Root causes are identified, preventative measures are developed, and employees are assigned to implement them.
While this sounds modern and high-tech, the devil is in the details…
Let’s look more closely at that last option involving the process and the commission analyzing incidents, keeping in mind that all process participants are human.
The Incident Review Board
This board consists of respected employees holding relevant positions. They rightfully occupy these roles and possess a broad outlook, knowledge, and experience. However, obviously, they cannot know everything. They may not always have the competence to make a correct judgment. For example, they might have significant experience in server administration, but the problem occurred in the networking environment—specifically within dynamic routing protocols they have only read or heard about.
On the other hand, their role implies making a decision, even with insufficient knowledge in a specific area.
There is also a third aspect: the psychological one. Most people in leadership positions have a subconscious fear of losing face and publicly admitting they don’t understand the issue.
Over time, a standard behavioral algorithm develops for members of the review board. The analysis of any incident must give them answers to two questions: Who is to blame? and What do we do?
The Engineers
Now let’s look at this process through the eyes of the responsible employee. During the incident, they reacted quickly, eliminated the consequences, and restored their system.
Are they a hero?
Yes, absolutely. But almost immediately after the incident, a colleague or manager asks them to prepare a report describing exactly what happened—naturally, with precise timestamps, all details, and in accordance with the established form.
Even if all causes are known and clear, putting them on paper takes time and effort. But what if it turns out the outage happened because of an action taken by the employee themselves? They are human, too. And they don’t want to feel guilty. So, instead of solving current tasks, they start thinking about how to present the facts to avoid liability.
The larger the organization, the more formalized this process becomes, and the stronger the impression that all these reports and reviews are part of a punishment system for the mere fact that an incident occurred.
Who suffers most from this approach?
Let’s take it as a given that humans make mistakes.
IT specialists responsible for keeping infrastructure running are no exception. Let’s assume a specialist makes a mistake in one out of a thousand operations. A simple typo while configuring equipment might not affect the system at all—for example, if an engineer makes a typo in the description field. But with the same probability, a typo in the configuration logic can lead to the loss of a critical company resource or even the entire infrastructure.
In any case, the more operations a specialist performs, the more mistakes they make. In other words: those who work the most, make the most mistakes, and consequently suffer the most from this attitude toward incidents.
It turns out that the process created to combat incidents actually begins to backfire in some cases:
-
It wastes more employee time, distracting them from daily tasks and increasing the risk of future incidents.
-
It indirectly punishes the hardest workers.
That doesn’t seem quite right, does it?
Criticizing? Propose a Solution!
First and foremost, with any incident management approach, I advise against looking for scapegoats. All participants in the process without exception—users, specialists, managers, and even business owners—should focus on quickly identifying the root causes and eliminating them, rather than finding someone to blame. This is only possible if everyone is confident that telling the truth won’t get them in trouble.
In other words, during any incident, the question should be: “What was done last, and by whom?” And every specialist should answer honestly and without delay: “I did nothing” or “I did X at location Y.” If the incident is caused by human error, this cuts troubleshooting time drastically. If you look deeper, you could say the specialist has a right—or rather, a Privilege—to Make Mistakes.
Judge for yourself. If they are scolded for every slip-up, glitch, or downtime, the natural reaction will be to not touch the system at all. Consequently, they won’t fully understand its capabilities and limitations. This means more time will be needed to react in the event of a system failure.
An experienced specialist is not afraid to make mistakes. Through them, they gain valuable experience and an understanding of the system’s strengths and weaknesses.
At the same time, an experienced specialist never forgets the responsibility and trust placed in them. After all, their actions directly affect the availability of the systems and services entrusted to them, and thus the business as a whole.
The specialist’s motto should be: “You broke it? You fix it!”
Remember Responsibility
Every specialist should know that a single incident caused by their actions will not result in any punishment. However, everyone will expect their active participation in analyzing the causes.
In the case of a repeated incident, the company should consider changing the IT process that might be leading to human error.
Only upon the third recurrence of the same incident involving the same specialist should disciplinary measures be considered.
To summarize:
-
Don’t look for the guilty party. This ultimately leads to an increase in incidents or the time required to resolve them—and consequently, direct losses for the company.
-
Instead, try to minimize the time spent finding the root cause.
-
If the system isn’t working, nobody touched anything, and it’s unclear what to do with it—just reboot it.