With the introduction of vSphere 4.0 the possibilities of using Alarms in VMware are greatly improved. This article describes some of the possibilities that are now available while using alarms. We'll go through some of the default defined alarms and customize them to our environment to make sure they do what we want. Note that we'll only use Email Notification in our environment. SNMP traps are also supported, but not used with us.
Since Alarms consists of 4 tabs we'll go through them per tab:
General
Naming, description, etc.
Alarm Type & Trigger
Monitor for specific condition
Monitor for specific event
Enabling Alarm
Trigger Configuration
Reporting Configuration
Range
Frequency
Actions
After going through all the options I'll tell you how to configure vCenter for email notification and I'll give you my minimal customizations in the vCenter Alarm Definitions.
By default alarm definitions are configured at the vCenter level. So in your vCenter select the object representing the vCenter server, select the tab Alarms and select the Definitions view:
During this walk through we'll focus on the defined alarm “Cannot connect to storage”. In the general tab we'll leave the default alarm name but we'll modify the description so other system administrators know that I changed it:
Default is:
Default alarm to monitor host connectivity to storage device
We'll change it to:
Customized alarm to monitor host connectivity to storage device - sjoerd, 7 June 2011.
As you can see it's really obvious that the the alarm is changed and by who.
The alarm type we use now is for hosts. Note that it's possible to create alarms for:
Now note that for an alarm to work it needs to be triggered. In VMware the triggering can be done in two different ways:
This is the result:
In the tab “Triggers” there are already three events added that can be configured to trigger the alarm:
Lost Storage Connectivity
Lost Storage Path Redundancy
Degraded Storage Path Redundancy
Each of them has a default status of “unset” and can have extra conditions so it's possible to only activate the trigger when it happens on a specific datacenter, datastore, host, etc. The default status is not really helpful, it means the event will never trigger the alarm. We'll set the events like this:
Lost Storage Connectivity : Alert
Lost Storage Path Redundancy : Warning
Degraded Storage Path Redundancy : Warning
These options are chosen according to the amount of trouble they give. Lost storage connectivity means end users will not be able to work anymore while path redundancy can impact performance, but ens users will still be able to work. We won't set any conditions since we want the alarms to work on the entire environment.
This gives this result:
Note: As you can see in the screenshot the alarm will be triggered if ANY of the specified events occur. Since this is a default alarm that we are slightly customizing this is an option that cannot be changes. If you want the alarm to be triggered if all events occur you'll have to create the alarm manually. Then you'll have the option to customize this.
As you can see below this is not customizable when monitoring for events. That is logical, because if you lose storage connectivity you can't have a fluctuation for example as you can have with CPU usage:
It is however interesting to dive in these options a little bit deeper, at least explaining what it should do:
Using Range and Frequency with Alarms
The Range parameter specifies a tolerance percentage above or below the configured threshold. For example, the built-in alarm for virtual machine CPU usage specifies a warning threshold of 75 percent but specifies a range of 0. This means that the trigger will activate the alarm at exactly 75 percent. However, if the Range parameter were set to 5 percent, then the trigger would not activate the alarm until 80 percent (75 percent threshold + 5 percent tolerance range). This helps prevent alarm states from transitioning because of false changes in a condition by providing a range of tolerance.
The Frequency parameter controls the period of time during which a triggered alarm is not reported again. Using the built-in VM CPU usage alarm as our example, the Frequency parameter is set, by default, to five minutes. This means that a virtual machine whose CPU usage triggers the activation of the alarm won't get reported again – assuming the condition or state is still true – for five minutes.
In the action tab it's possible to define the specific action that should be taken when the alarm gets triggered. This can be done on four different alarm state changes:
From a green circle to a yellow triangle
From a yellow triangle to a red diamond
From a red diamond to a yellow triangle
From a yellow triangle to a green circle
For every action you can define these options:
empty: there is no interest in the transition
once: the action gets performed only once
repeat: the action gets repeated on the frequency defined (from 1 minute to 2 days, 5 minute default)
Now the question is, how much minutes may be acceptable to have a notification send again? The assumption is that whoever gets the first notification will work on it as fast as possible since it is a severe warning/alert. However, some repeat may be expected in case somebody accidental forgets the email. I decided to set it to 240 minutes.
Also, considering what I've set in in the trigger configuration I only want the Alert to be repeated, not the warnings. All this gives me this result:
Note that there are other actions available as well:
Every Alarm has these actions available:
VM- and host-alarms have more actions:
Power on a virtual machine
Power off a virtual machine
Suspend a virtual machine
Reboot host
Shut down host
Before vCenter is capable of sending email it needs to know some email settings. Go to Administration → vCenter Server Settings → Mail and fill in the correct values:
Note: There is no way in vCenter to test this configuration. The best way to test is to make a custom alarm, on something like VM CPU usage and set it to sent an email when usage is above 20% or something. That will be triggered pretty fast so emails will be sent.
This is an overview of default alarms as defined in vCenter 4.1 that needs to be customized as described above or as described below:
Host connectivity
Alarm Name: Host connection and power state
Description: Customized alarm to monitor host connection and power state - sjoerd - 8 June 2011
Alarm type: Hosts - monitor conditions or state
Alert: Host connection state is equal to Not responding: Send notification email every 4 hours
Alert: Host connection state is equal to Disconnected: Send notification email every 4 hours
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
HA operations and errors
Alarm Name: Cluster high availability error
Description: Customized alarm to monitor high availability errors on a cluster - sjoerd, 21 December 2011
Alarm type: Clusters - monitor events
Alert: HA host isolated: Send notification email every 4 hours
Alert: All HA hosts isolated: Send notification email every 4 hours
Alert: HA host failed: Send notification email every 4 hours
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
Host CPU usage
Alarm Name: Host cpu usage
Description: Customized alarm to monitor host CPU usage - sjoerd, 21 December 2011
Alarm type: Hosts - monitor conditions or state
Alert: Host memory usage is above 90% for 15 minutes: Send notification email every 4 hours
Warning: Host memory usage is above 75% for 60 minutes: Send notification email once
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
Host Memory usage
Alarm Name: Host memory usage
Description: Customized alarm to monitor host memory usage - sjoerd - 8 June 2011
Alarm type: Hosts - monitor conditions or state
Alert: Host memory usage is above 90% for 15 minutes: Send notification email every 4 hours
Warning: Host memory usage is above 75% for 60 minutes: Send notification email once
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
Storage capacity
Alarm Name: Datastore overallocation
Description: New alarm to replace “datastore usage on disk”. Since we use guaranteed storage only datastore usage is not of any use. In stead we monitor on accidentely overallocation of datastores - sjoerd, 21 December 2011
Alarm type: Datastores - monitor conditions or state
Alert: Datastore Disk Overallocation is above 125%: Send notification email every 8 hours
Warning: Datastore Disk Overallocation is above 110%: Send notification email once
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
Storage connectivity
Alarm Name: Cannot Connect to storage
Description: Customized alarm to monitor host connectivity to storage device - sjoerd, 7 June 2011.
Alarm type: Hosts - monitor events
Alert: Lost Storage Connectivity: Send notification email every 4 hours
Warning: Lost Storage Path Redundancy: Send notification email once
Warning: Degraded Storage Path Redundancy: Send notification email once
Send email to:
it_warmetal_nl
sjoerd_warmetal_nl
Note that the default alarm “Datastore usage on disk” has been disabled and replaced by “Datastore overallocation”.
Discussion