Correlating Events- Predicting and Preventing Recurring Outages

  • Background
  • Challenge
  • Method
  • Benefit

Organizations invest a significant amount of time and money on secure and robust networks, but they do not always have an adequate understanding of how well those networks are performing and operating. Monitoring and event management tools are designed to provide insightful information about the functional operation of the network and its various components, as well as reduce the meantime to resolve and repair any identified problems on the network.

When a Department of Defense (DoD) agency noticed that a sporadic sequence of network events was cumulatively stopping network services from functioning, they looked to SMS for a customized monitoring solution to help detect and mitigate the events that were negatively impacting their entire user community.

To customize and configure a monitoring solution in order to track sequences of network events that were stopping critical network services and impacting the user community.

Domain Name System (DNS) administrators had been attempting to upgrade their systems, but were spending a significant amount of time keeping their existing DNS systems up and running due to frequent outages. Prior to each outage, the administrators had noticed a sequence of specific events discovered through observation and logs. When this series of events occurred within a limited time frame, they knew the environment would quickly become unreliable and unpredictable.

Working with the DNS administrators, the SMS event management team began monitoring for the event sequences within a sliding window timeframe. The team devised a solution that correlated and suppressed associated alerts into a single, critical alarm to notify network administrators of a potential adverse situation when events were detected. In addition to the alarm, the solution also automated corrective actions. Since the system would gradually lose services before going down completely, the solution’s automated actions – such as restarting those specific DNS services as they began to deteriorate – kept the DNS running, allowing the administrators to refocus their efforts within a more stable DNS environment.

SMS further enhanced the management capability of the alarm function with an automated trouble ticket function that would open an incident and populate it with relevant information associated with the situation. The management system was also configured to execute remote commands on the systems hosting the critical services, proactively preventing a failure.

The SMS event management solution helped the customer effectively monitor and accurately track the situation while eliminating the overall service failure on their existing systems. Our alerts and tickets fed into Remedy ticketing system that generated automated emails populated with all relevant event details, such as host name, IP, first occurrence, last occurrence, tally, and a summary message. This decreased meantime to repair and significantly increased uptime while also improving the overall contract service metric and SLA numbers. The SMS solution provided the system administrators the much needed time to successfully upgrade and enhance their environment to host their critical services.