by Ken Shafton | Jul 27, 2018
Know about issues before they affect your business.
Everyone knows they need monitoring to ensure site uptime and keep their business humming. Yet, many sites still suffer from outages first reported by their customers. Having managed data center operations for decades, we have the experience to monitor systems of all kinds
These are some of the most common mistakes we have seen and how to address them.
MONITORING MISTAKE #1
Relying on Individuals and Human-Driven Processes
A situation we have seen many times flows a bit like this:
- It is the midst of a crisis - were you lucky enough to get Slashdotted?
- A change is made to your data center equipment - add a new volume is added to your NetApp so that it can serve as high speed storage for your web tier.
- Moving quickly, you forget to add the new volume to your NetApp monitoring.
Post-crisis, everyone is too busy breathing sighs of relief to worry that new volume. It slowly but surely fills up or starts exhibiting latency, due to high IO operations.
No one is alerted, and customers are the first to notice, call in, and complain. Quite possibly, the CTO is the next to call.
Remove human configuration as much as possible - not just because it saves people time, but because it makes monitoring - and hence the services monitored - that much more reliable.
When looking at solution features, consider the following:
- Examines monitored devices continually for modifications, automatically adding new volumes, interfaces, load balancer VIPs, database instances, and any other changes into monitoring, and informing you via email in real time or batched notifications, as you prefer.
- Provides filtering and classification of discovered changes to avoid alert overload.
Scans your subnets, or even your Amazon EC2 account, and automatically adds new machines or instances to monitoring.
- Graphs and spontaneously creates intelligent dashboards. A dashboard graph based on the sum of the sessions on ten web servers used to view the health of your service, should automatically update when you add 4 more servers. Automation of this collection and representation ensures continuity of your business overview.
Do not depend on manual monitoring updates to cover adds, moves, and changes. It does not happen.
MONITORING MISTAKE #2
Considering an Issue Resolved when Monitoring Cannot Detect Recurrence
Outages occur, even when you follow good monitoring practices. An issue is not resolved, though, without
ensuring monitoring detects the root cause or is modified to provide early warning.
For example, a Java application experiencing a service-affecting
outage due to a large number of users overloading the system probably exhibited an increase in the number of busy threads. Modify your JMX monitoring to watch for this increase. If you create an alert threshold on this metric, you can receive advanced warning next time. Early warning at least provides a window in which to avoid the outage: time to add another system to share the load or activate load- shedding mechanisms. Configuration of alerts in response to downtime allows you to be proactive next time. The next time you experience an outage, root cause should never point to a repeated preventable event.
This is a very important principle. Recovery of service is the first step, it does not mean the issue should be closed or dismissed. You need to be satisfied with the warnings your monitoring solution gave before the issue, and content with the alert types and escalations that triggered during the issue. It is possible that the issue is one with no way to warn in advance - catastrophic device failure does occur - but this process of evaluation should be undertaken for every service- impacting event.
MONITORING MISTAKE #3
Alert overload is one of the most detrimental conditions. Too many alerts too frequently triggered results people tuning out all alerts.
You will run into a situation where critical production service outage alerts get pushed to scheduled downtime for 8 hours, the admin assuming it was another false alert.
You must prevent this by:
- Adopting sensible escalation policies, distinguishing between warnings and error or critical alert There is no need to wake people if NTP is out of sync, but if the primary database volume is seeing 200ms latency and transaction time is 18 seconds for an end user, that is critical. You need to be on it, no matter the time.
- Routing the right alerts to the right Don't alert the DBA about network issues and do not tell the networking group about a hung transaction.
- Tuning your Every alert must be real and meaningful. Tune the monitoring to get rid of false positives or alerts triggered on test systems.
- Investigating alerts triggered when everything seems okay. If you find there was no outward issue, adjust thresholds or disable the You can also reach out to LogicMonitor to ask about ramifications of disabling an alert.
- Ensuring alerts are acknowledged, resolved, and cleared. Hundreds of unacknowledged alerts are too difficult to allow easy parsing of an immediate issue. Use alert filtering to view only the groups of systems for which you are responsible.
- Analyzing your top alerts by host or by alert Investigate to see whether remedying issues in monitoring, systems, or operational processes can reduce the frequency of these alerts.
MONITORING MISTAKE #4
Monitoring System Sprawl
You need one monitoring system. Do not deploy a monitoring system for windows servers, another for Linux, another for MySQL, and another for storage. Even if each system is highly functional and capable, having multiple systems guarantees suboptimal datacenter performance. Your staff needs one place to update contact information. You do not want up-to-date information in the escalation methods of two systems but not the other two.
You do not want maintenance correctly scheduled in one monitoring system but not in the one used to track other components of the same systems. You will experience incorrectly routed alerts, ultimately resulting in alert overload. A system that notifies people about issues they have no ability to acknowledge leads to Oh…I turned my pager off…
A variant of this problem is when your SysAdmins and DBAs automate things by writing cron jobs or stored procedures to check and alert on issues. The first part is great - the checking. However, alerting should happen through your monitoring system. Just have the monitoring system run the script and check the output, call the stored procedure, or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on. Locally run, one-off alerting hacks will not incorporate these monitoring system features.
MONITORING MISTAKE #5
Not Monitoring Your Monitoring System
Your monitoring solution can fail. Ignoring this fact only leaves you exposed. Companies invest significant capital to set up monitoring and understand recurring cost in staff time, but then fail to monitor the
system. Who knows when a hard drive or memory failure occurs, an OS or application crash happens, a network outage at your ISP or power failure? Don't let your monitoring system leave you blind to the health of the rest of your infrastructure. The monitoring system encompasses the complete system, including the ability to send alerts. If the outgoing mail and SMS message delivery connection is down, your monitoring system might detect an outage but it is only apparent to staff watching the console. A system that cannot send alerts is not helping.
False security is worse than having no monitoring system at all. If you do not have a monitoring system, you know you need to execute manual health checks. If you have an unmonitored system that is down, you're not executing health checks and you're unwittingly exposing the business to an undetected outage.
Minimize your risk by configuring a check of your monitoring system from a location outside the reach of the monitoring system. Or, select a monitoring solution that's not only hosted in a separate location but also checking the health of their own monitoring solution from multiple locations.
Omni Data LLC
West Haven, Connecticut
T: 203-387-6664 | W: www.myomnidata.com