Friday, February 5, 2010

Win32_Service Memory Leak

During the development of SQL Sentry 5.5 we noticed we were receiving errors from some of our watched development servers. The error was from the WMI subsystem and simply stated “Out of Memory.” After searching for a bit to try to determine the cause, we realized that on all the affected watched servers the wmiprvse.exe process was using around 512MB of memory. Doing some additional searches turned up the following blog post:

http://blogs.technet.com/askperf/archive/2008/09/16/memory-and-handle-quotas-in-the-wmi-provider-service.aspx

in which Mark Ghazai, a member of the Windows Performance Team, discussed the wmiprvse.exe process and the 512 meg cap. In a nutshell, the wmiprvse.exe process is the WMI Provider Service, which acts as a host for WMI providers such as win32_service. It has a cap of 512 megabytes which can be adjusted, but in the case of a memory leak, that would just be a band-aid. We needed to get to the root of the problem. Why was this process spiking to 512MB to begin with?

The first thing we noticed was that this problem only showed up on Windows 7 and Windows Server 2008 R2, so it was specific to Windows 6.1. It also happened only on systems we watched, which makes sense because we use WMI heavily. We could look at the wmiprvse.exe process throughout the day and see that the memory usage was steadily rising. A mitigating factor is that this process will actually terminate itself after a period of inactivity, but in the case of a monitoring system like SQL Sentry, we don’t ever wait long enough for that period of inactivity to elapse. The question remained, exactly what were we doing that was causing this process to increase in memory on Windows 7 and 2008 R2?

The next step was to try to profile the process for a memory leak. A quick search in the Debugging Tools for Windows help document (WinDbg) revealed a helpful topic called “Using UMDH to Find a User-Mode Memory Leak.” Seeing as that was exactly what I wanted I started in earnest.

The first step involves setting up your symbols. In order to analyze a memory leak you have to be able to look at the call stacks, and the only way you can get call stack information from an unmanaged executable is with symbols. Fortunately this is pretty easy since Microsoft provides symbol servers. The following command, taken from the documentation, can be used to set up the symbol path.

set _NT_SYMBOL_PATH=c:\mysymbols;srv*c:\mycache*http://msdl.microsoft.com/download/symbols

The next step was to use GFlags to enable UMDH stack traces as outlined in the WinDbg documentation. We started GFlags and turned on Stack Backtrace (Megs) for the wmiprvse.exe image by clicking the checkbox. After that you have to restart the process, so I just killed wmiprvse.exe. It gets auto-launched the first time a WMI query is executed, so it respawned right away.

Once the process was running we needed to collect our allocation snapshots. To do so, you use:
umdh –p:<processid> –f:<logfilename>
Each time you run the above command, it generates a snapshot of the current allocations. What we are doing here is taking a peek at all the unmanaged memory allocations from the process and their corresponding call stacks. So I ran that once, waited for the memory used by that process to increase by about 1 megabyte, then ran it again using a different log file name.

The next step is to run these files back through umdh to create a differential file. UMDH will compare the allocations in one file to the allocations in the other and determine what memory allocations made in the earlier file still exist and have not been cleaned up by the time the second file was created. This is done using the following command:

umdh <file1> <file2> > <outfile>

The > before <outfile> is just a redirect showing where you want the output to go to. This will generate a new file which is readable. After the symbol listing at the top of the file are the allocations. Not everything in this list is a problem. Something could be in this list just because it hasn’t been cleaned up yet, but in our case, one entry always showed up at the top. Furthermore, the numbers got larger as time went on (I only included the top six lines of the call stack).

+   c28ba ( 185174 - c28ba)   1078 allocs    BackTrace2980620
+     83c (  1078 -   83c)    BackTrace2980620    allocations

    ntdll! ?? ::FNODOBFM::`string'+0001A81B
    msvcrt!malloc+00000070
    cimwin32!operator new+00000009
    cimwin32!CWin32Service::LoadPropertyValuesWin2K+000004A1
    cimwin32!CWin32Service::AddDynamicInstancesNT+00000200
    framedynos!Provider::CreateInstanceEnum+00000034

As you can see, CWin32Service is the leaky class, and I presumed that it was the code that supplied the functionality for the Win32_Service WMI provider. The next step was validating this outside our code, so I got on a system that SQL Sentry was not looking at to ensure there wasn’t any interference in my metrics and ran the following query in wbemtest:

select * from win32_service

Each time, the wmiprvse.exe process memory went up, but never down. I then decided to throw a heavier test at it, so I whipped up a little powershell function

for ($i=0; $i -le100; $i++) { get-wmiobject win32_service | format-table  }

Running that caused wmiprvse.exe to continually increase in memory while it was running, so I had my smoking gun and proceeded to file a bug report with Microsoft.

So, where are we now? After going back and forth with Microsoft on this, they have filed it for the next major release of the OS, i.e. it won’t be fixed in Windows 7 or 2008 R2 in any service pack or hotfix. Apparently the changes are “too invasive.” We are currently working with Microsoft to see if we can escalate this and get it fixed. In the meantime we have other options for querying service status, like using the Service Control Manager, we’re just making sure that it doesn’t cause any issues that we’ve never seen before. In 5.5 we’ll be including an App.Config option called useScmForServiceStatus that we can turn on and off for testing, or to switch to SCM if WMI is causing problems in your environment.

5 comments:

Andy Kelly said...

That's pretty sad that MS didn't catch something like that themselves in all their testing. But glad you got to the bottom of it.

Jason said...

Does this fix it? http://support.microsoft.com/kb/977357

Brooke Philpott said...

That actually looks like a leak in svchost.exe. This particular leak occurs in wmiprvse.exe. Both are hosting processes, but svchost.exe wraps the actual core service, vs wmiprvse.exe which wraps the providers, of which in my case Win32_Service was leaking. I've been given a patch by Microsoft for private testing and it fixed that issue (although there seems to be another leak in a different dll).

Unknown said...

Looks like MS released a new hotfix for this:

http://support.microsoft.com/kb/981314

Brooke Philpott said...

Thanks for the heads up. Microsoft informed me of this as well.