Business case: Overall performance of virtual infrastructure is fine. One VM is not performing according to what can be expected based on results of the past or according to specs of the vendor.
Step one, defining an incident and performing standard troubleshooting failed. Activities in this step may include but are not limited to:
Determing what has changed
Carry out vendor best practices regarding a virtual environment and the application
Define reservations CPU/RAM
Assign more resources CPU/RAM
Assign more shares CPU/RAM/DISK
Migrate the VM to different host
Migrate the VM to different datastore
Check limits CPU/RAM
Check traffic shaping policies on port group
Check Guest
OS Swap file (1,5 x configured memory)
Check Guest
OS file compression (should be turned off)
According to
ITIL you can create a problem if you can't find the solution or the underlying cause of one or more incidents. Defining a problem should allow you to assign more resources to reach a solution, so more resources for VM, more time for troubleshooting, new hardware and so on. I prefer to start with assigning time so the underlying cause can be found. If you know the cause you can prevent the problem from occuring in the future and you know for sure that the solution you eventually come up with will deal with the specific underlying cause.
So, assuming the time is assigned, how to tackle this particular VM. By following these guidelines you're sure the find the bottleneck:
First, find a timeframe where the specific problem does not occur. So, during the day, evening, weekend.
Second, find a timeframe where the specific problem does occur. Again, during the day, evening, weekend.
Then start monitoring the VM during the first timeframe, to determine if there are any bottlenecks. To do so, monitor the counters below all at the same time.
Now determine if there is anything wrong that could speed up the process even more. Remember, this is the timeframe where everything goes well and you want it to be perfect, otherwise you could draw the wrong conclusion at the end.
After solving the bottlenecks in the previous step, monitor the exact same counters during the second timeframe.
Now comes the hard part, now you have to combine all the information gathered and determine the exact underlaying cause of the problem. There's no help in that. You need years of experience and knowledge to do that.
There might be a catch, the above activities will tell you the root of the problem but not your solution. It will just tell you your bottleneck. Your solution might for example be found in:
It could also be the case that you need to investigate even further:
Network related problems can be analyzed using wireshark
Sometimes you need to assign specific processes to specific CPUs (on VM (CPU affinity) or guest level)
Process monitoring on the guest
FC paths (MPIO)
Windows perfmon – average disk queue length. This contains both active and queued commands.
Linux - top
ESX - esxtop - qstats
Windows pagefile (1,5 times the memory size)
or any other activity
VM Counters
Per CPU
Memory
Consumed
Active
Overhead
Swap in
Swap out
Balloon
Usage
Network usage per NIC
Usage
Data transmit rate
data receive rate
Disk per datastore
Read Rate
Write Rate
Commands Issued
Command Aborts
Host Counters
-
You can use this
data sheet to make a summary of the VM you're about to troubleshoot.
Discussion