Geek Babble

I have a client with 2 identical Hyper-v servers running almost identical VMs. One of the servers out of the blue started having high CPU utilization. The host was bouncing from 35-50% and the guests were at 99% CPU utilization. Turned off the guests and reboot server, no change. Still 35-50% utilization. Made sure any unnecessary hardware was disabled or disconnected, again no change. Experimenting with one of the guest machines I noticed that the CPU utilization would sometimes show system interrupts at 99% then go away for a bit and then come back with any process that was active taking over the 99% utilization. After seeing that I wanted to check into system interrupts on each host machine and compare them.

In the past I had used KernView on 32bit machines, however this does not work for modern 64bit machines. After some digging around on the internet it turns out KernRate works on 64bit machines and can be found in the Windows Driver Development Kit 7 found here http://www.microsoft.com/en-us/download/details.aspx?id=11800. If you choose the default install path the files can be found here C:\WinDDK\7600.16385.1\Tools\Other\amd64.

I wanted to log the output and have it run for a fixed time for comparison. After looking through the help files I found my command to be ‘kernrate -s 30 -yo filename.txt’ which would give me a 30 second sample and write it to a file in the same path with the chosen file name. I ran the command on both my host that was not having issues and the one that was having issues. I will cut to the interesting parts of the resulting log files in order to save space on this post.

Server specs (both servers are the same):
Dell 320
32GB ram
Intel E5-2420 CPU (6 hyper-threaded cores)
Server 2012 with Hyper-V role installed

Server with issues:

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 147

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime 276703 hits, 10002 events per hit ——–

Module Hits msec %Total Events/Sec
NTOSKRNL 138197 30074 49 % 45961508
HAL 126880 30074 45 % 42197704
WIN32K 7230 30026 2 % 2408394
NTFS 1030 30055 0 % 342773

Server without issues:

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 145
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime 289145 hits, 10009 events per hit ——–

Module Hits msec %Total Events/Sec
NTOSKRNL 244130 29999 84 % 81452620
HAL 41760 29999 14 % 13932992
WIN32K 1650 29999 0 % 550513
IPMIDRV 625 30000 0 % 208520

So I noticed right away that the server with issues has 45% of interrupts going to the HAL. The HAL is short for Hardware Abstraction Layer which is a piece of the operating system that allow other parts of the operating system interact with the physical hardware of the computer. Modern versions of Windows automatically select the HAL used based on the processor type, but I still verified both servers were using the same one. Again I disabled any unnecessary hardware, turned the guest machines off, updated drivers and ran KernRate between each step, all with very similar results.

After testing many configurations, drivers, and multiple reboots I was frustrated at the hours lost and the lack of understating why this was occurring. I had one last resort before declaring a bad CPU or motherboard and calling Dell for warranty. I upgraded the bios and rebooted. I had left all disabled devices disabled and the guest machines off in order to limit the changes. A few minutes after rebooting I logged back in and opened task manager to a pleasant 10% CPU utilization. I re-enabled all devices and turned the guests back on. Everything seemed nice and fast, including the guest performance. I again ran KernRate to see if there was any difference in the results.

After BIOS update on bad machine:

OutputResults: KernelModuleCount = 144
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime 341514 hits, 10009 events per hit ——–

Module Hits msec %Total Events/Sec
NTOSKRNL 332831 29999 97% 111047217
HAL 6673 29999 1 % 2226409
IPMIDRV 835 29999 0 % 278593
NTFS 395 29999 0 % 131789

Wow, that is quite a difference from the previous result and even better than the machine that was working seemly well. I am going to schedule a window of time to do the BIOS update on the second machine sometime in the future and see if the BIOS update will achieve a similar result. As with any updates or changes please backup your data and double check your BIOS update is for the correct machine as a BIOS update can go south and your machine will no longer boot.

Britt Adams – IT Geek

Tag Archives: Server 2012

Unexplained High CPU usage on Hyper-V host and Guests

Server 2012 Learning Resources