Monitoring system. Benefits and Challenges

Alexey Yurchenko

4 months ago

Monitoring systems are, without a doubt, a useful and critical tool. Everyone recognizes their necessity. However, for most organizations, implementing these systems is fraught with difficulties. It often happens that engineers spend most of their time maintaining the monitoring system itself rather than having it help them solve problems.

Let me share some personal experience that will likely help many people better understand the problem.

Monitoring “As Usual”

So, management concludes that a monitoring system is critically important and assigns the task of implementing it. An engineer, who is already busy with their core duties—like configuring networks—is told to “deploy a monitoring system.” Often, no one knows how to do this because it’s their first experience in this area. Let’s say the company has a hundred remote branches, and they need to monitor the availability of communication links with each of them. In this situation, they start looking for free solutions, studying available options, and picking one.

Usually, they choose the simplest solution based on checking endpoint availability via ping. After spending significant time installing and configuring it, they expect immediate results. However, the system starts throwing false positives. For example, a remote branch has a primary and a backup link, and the system monitors both. The engineer gets an alert about link issues, checks the availability, and finds that everything is working fine. The monitoring system cried wolf. Despite cases where there are actual problems, most alerts turn out to be false. As a result, instead of helping, the monitoring system simply scatters your focus, leading people to eventually neglect it.

Let’s look at another scenario. In addition to remote sites and links, the central office has critical equipment that needs monitoring. You set up availability checks for devices and their parameters. However, when a real outage occurs in the central office—especially if it affects network equipment—the monitoring system turns out to be useless. How can it transmit information if the infrastructure’s root equipment is down? You will most likely hear about the problem via a phone call or someone yelling, “Everything is broken, go fix it!” After the issue is resolved, the monitoring system will come back online and flood your inbox with thousands of emails about what happened. Very timely, right?

Suppose the company realizes the need for a more advanced monitoring system and chooses a solution with SNMP support. This protocol allows devices to send data about their status—such as CPU load, memory usage, and network interface status—to a centralized collector. After installation and setup, even if you aren’t a specialist in this field, you discover that every device on the network is constantly reporting some kind of “problem.” For instance, one device has constantly high CPU load, another is low on memory. Generally, this state might be normal for that specific device. However, the monitoring system constantly flags issues, demanding attention and verification. The result is the same: a reluctance to react to constant false alarms and a desire to receive only reliable information about real problems. Ultimately, the monitoring system gets less attention than it should.

And the final scenario. Suppose you got really into the setup, and everything is configured perfectly. Data is being sent correctly, and false positives are minimized. But an outage happens after hours. What do you do? If notifications are set up for external communication channels, you’ll get the alert at home. But is it worth driving into work in the middle of the night? If notifications aren’t set, you’ll only find out about the problem in the morning, which leads to lost time and a delayed response.

Monitoring: Successful Use Cases

As a constructive suggestion, I’ll share some experience gained from a professional in the field of monitoring. He told me about several successful cases where a monitoring system was used for good.

The first case involves monitoring a router in a remote office that acts as a gateway for internal networks and has external communication links. Simply tracking external interfaces doesn’t allow you to determine if the entire router has failed. Monitoring internal interfaces also doesn’t give a full picture, as they can be disabled accidentally or intentionally. Checking CPU or memory load might not indicate an outage but rather reflect normal device operation. So, what do you do? The solution is event correlation. Every office has a set of static infrastructure services: a print server, file server, DNS server, email server. Modern monitoring systems allow you to link events; meaning, to determine a router failure, you need to track the failure of all these services within a short timeframe—for example, 15 seconds. Combined with link monitoring, this gives you grounds to call in an engineer and suspect a router failure.

The second case is one of the most illustrative examples demonstrating the value of monitoring systems, even from a financial perspective. A guy who worked in the banking sector told me about this. As you know, banks have an extensive network of service offices, ATMs, and other remote points connected to the central office via communication links.

These links are usually purchased from a large ISP capable of providing coverage for the necessary territory and are secured by a contract. The contract includes an SLA (Service Level Agreement), guaranteeing 24/7 link uptime. All parties agree, the contract is signed, and connectivity is provided.

However, given the scale of the network, the ISP periodically has issues leading to glitches and disconnects. This is where the efficiency of the monitoring system shines. First, the system automatically generated a ticket for the ISP upon detecting a connection break. This ticket, containing the contract number, failure point info, timestamps, and other parameters, was sent to the duty engineer, who only had to confirm it and email it to the ISP’s tech support.

Secondly, the monitoring system kept statistics on all tickets and generated a monthly report reflecting the total downtime of communication channels and details on each incident. This allowed the company to save significant money through penalties for SLA non-compliance. Thus, a win-win situation was created: if there were no failures, the ISP faithfully fulfilled its obligations; otherwise, the bank received compensation for the breach of contract terms.

Afterword

Implementing a monitoring system is a complex and responsible task, requiring not only specialized knowledge but also an understanding of the company’s business processes. Therefore, it is inefficient to assign its setup to engineers who are busy with other tasks.

For the system to work effectively, you need a dedicated specialist to handle its configuration, optimization, and monitoring. Furthermore, you need staff who can react promptly to emerging incidents—a duty shift or specially trained specialists (NOC). With a comprehensive approach, a monitoring system becomes an indispensable tool for the company, significantly simplifying the work of engineers and increasing business efficiency.