Unexplained High CPU usage on Hyper-V host and Guests

I have a client with 2 identical Hyper-v servers running almost identical VMs. One of the servers out of the blue started having high CPU utilization. The host was bouncing from 35-50% and the guests were at 99% CPU utilization. Turned off the guests and reboot server, no change. Still 35-50% utilization. Made sure any unnecessary hardware was disabled or disconnected, again no change. Experimenting with one of the guest machines I noticed that the CPU utilization would sometimes show system interrupts at 99% then go away for a bit and then come back with any process that was active taking over the 99% utilization. After seeing that I wanted to check into system interrupts on each host machine and compare them.

In the past I had used KernView on 32bit machines, however this does not work for modern 64bit machines.  After some digging around on the internet it turns out KernRate works on 64bit machines and can be found in the Windows Driver Development Kit 7 found here http://www.microsoft.com/en-us/download/details.aspx?id=11800. If you choose the default install path the files can be found here C:\WinDDK\7600.16385.1\Tools\Other\amd64.

I wanted to log the output and have it run for a fixed time for comparison.  After looking through the help files I found my command to be ‘kernrate -s 30 -yo filename.txt’ which would give me a 30 second sample and write it to a file in the same path with the chosen file name. I ran the command on both my host that was not having issues and the one that was having issues.  I will cut to the interesting parts of the resulting log files in order to save space on this post.

Server specs (both servers are the same):
Dell 320
32GB ram
Intel E5-2420 CPU (6 hyper-threaded cores)
Server 2012 with Hyper-V role installed

Server with issues:

Results for Kernel Mode:

OutputResults: KernelModuleCount = 147

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime              276703 hits,           10002 events per hit ——–

Module                    Hits                  msec             %Total              Events/Sec
NTOSKRNL                138197            30074              49 %             45961508
HAL                          126880             30074              45 %             42197704
WIN32K                     7230             30026                  2 %                2408394
NTFS                           1030              30055                 0 %                  342773

Server without issues:

Results for Kernel Mode:

OutputResults: KernelModuleCount = 145
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime            289145 hits,          10009 events per hit ——–

Module                    Hits                 msec              %Total             Events/Sec
NTOSKRNL            244130           29999                 84 %              81452620
HAL                          41760            29999                 14 %              13932992
WIN32K                   1650             29999                   0 %                   550513
IPMIDRV                   625            30000                   0 %                  208520

So I noticed right away that the server with issues has 45% of interrupts going to the HAL. The HAL is short for Hardware Abstraction Layer which  is a piece of the operating system that allow other parts of the operating system interact with the physical hardware of the computer. Modern versions of Windows automatically select the HAL used based on the processor type, but I still verified both servers were using the same one. Again I disabled any unnecessary hardware, turned the guest machines off, updated drivers and ran KernRate between each step, all with very similar results.

After testing many configurations, drivers, and multiple reboots I was frustrated at the hours lost and the lack of understating why this was occurring. I had one last resort before declaring a bad CPU or motherboard and calling Dell for warranty.  I upgraded the bios and rebooted. I had left all disabled devices disabled and the guest machines off in order to limit the changes. A few minutes after rebooting I logged back in and opened task manager to a pleasant 10% CPU utilization. I re-enabled all devices and turned the guests back on. Everything seemed nice and fast, including the guest performance. I again ran KernRate to see if there was any difference in the results.

After BIOS update on bad machine:

OutputResults: KernelModuleCount = 144
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime                 341514 hits,            10009 events per hit ——–

Module                           Hits                msec          %Total             Events/Sec
NTOSKRNL                   332831           29999             97%               111047217
HAL                                   6673           29999               1 %                   2226409
IPMIDRV                          835           29999               0 %                      278593
NTFS                                   395           29999               0 %                       131789

Wow, that is quite a difference from the previous result and even better than the machine that was working seemly well. I am going to schedule a window of time to do the BIOS update on the second machine sometime in the future and see if the BIOS update will achieve a similar result. As with any updates or changes please backup your data and double check your BIOS update is for the correct machine as a BIOS update can go south and your machine will no longer boot.

Server 2012 Learning Resources

Free Server 2012 E-Book


Server 2012 MCSA resources