The strength of any great monitoring solution is not just in its ability to report issues and the current state of servers and applications (there are a plethora of good monitoring tools that can do this), but in its ability to resolve problems automatically when they arise. Through the creation of automated tasks, administrators can simplify day to day management of servers and applications to allow support teams to concentrate on only the most critical of issues. This is what distinguishes a GREAT monitoring solution from a good one.
Here at 1E, we love the System Center suite of products. You only need whisper the words ‘Config Manager’ or ‘VMM’ in the ear of a 1E consultant to get their pupils dilated and their heart racing. Be careful with ‘Ops Manager’ though, as this can send one particular 1E consultant into a lather.
Why? I hear you ask... Simply put, Microsoft System Center Operations Manager 2007 (or OpsMgr for short) is the premiere enterprise class monitoring solution, and it just got better with the release of OpsMgr R2. But does OpsMgr meet my criteria for a ‘great’ monitoring solution? You bet!
The true power behind OpsMgr lies within the workflow engine that underpins everything. Each Discovery, Rule and Monitor is a workflow comprised of individual modules that are executed sequentially in order to identify the current status, overall health or relationships between objects. It is this workflow engine that empowers administrators to configure automated tasks. In OpsMgr, these are Diagnostic and Recovery Tasks and they are available for all Monitor workflows.
To better understand the difference between a Rule and a Monitor consider the following:
A rule collects data from various sources (PerfMon, Event Logs, SNMP etc) and stores that data in the Operations Database and the DataWarehouse database. This data is then made available for reporting purposes. Rules (generally speaking) do not generate alerts.
Similarly, a Monitor also collects data but for a different purpose. A monitor uses its captured data to determine the health state of an object. The monitor will change the state of an object (Healthy, Warning or Critical) in response to the information being gathered and the configuration of the monitor (i.e. CPU utilisation threshold), and may in turn generate an alert or run diagnostic and recovery tasks.
For more information, please see this TechNet article (http://technet.microsoft.com/en-us/library/bb977440.aspx)
With this in mind, let’s take a look at an example of what you can do with Diagnostic and Recovery Tasks.
In this example, we will look at the Health Service Heartbeat Failure monitor.
Open the Operations Console and navigate to the Authoring window, expand Management Pack Objects and select Monitors. Change the Scope to Health Service Watcher (Agent) and then expand Entity Health and Availability.
Now select Health Service Heartbeat Failure, right click and select Properties and then navigate to the Diagnostic and Recovery tab.
What we can see here are the Diagnostic Tasks (top) and the Recovery Tasks (bottom). From an execution perspective, if the monitor triggers a state change it will automatically launch all Diagnostic tasks that are configured for the specific Health State. It will then launch any recovery tasks, again based on the Health State, passing any parameters from the completed Diagnostic task.
Within the confines of our example;
1. The agent managed server does not reply to the heartbeat, causing the Health Service Heartbeat Failure monitor to trigger a state change from Healthy to Critical, indicating that either;
a. the OpsMgr Health Service service is in a stopped state or
b. the agent managed server is uncontactable
2. The Health Service Heartbeat Failure monitor executes the Ping Computer on Heartbeat Failure and Check If Health Service Is Running Diagnostic tasks.
3. Once complete, the next step will be to execute any Recovery tasks that are configured to run automatically for the specified health state. In our example there a 2 which relate to the state of the object, however as you can see there are additional Recovery tasks that can be enabled and configured to run. These include restarting the Health Service, enabling and restarting the Health Service and reinstalling the Health Service.
This has been a brief, high level overview of the Diagnostic and Recovery tasks capabilities of OpsMgr. In future blog posts I will cover this topic in greater detail, first examining the underlying xml code and configuration of Recovery and Diagnostic tasks, and then how to create Diagnostic and Recovery tasks to automate system management.