CLR Profiler Is Your Friend
08.04.2008This past week, I have been struggling with a strange problem on a client’s ASP.NET 2.0 website. The server, a 64-bit Windows Server 2003 with 4 Gb RAM, used up almost all available memory within a few seconds of starting the w3wp.exe (the IIS 6.0 worker process). During this excessive consumption, the CPU usage was 100%, and then dropped to 1-2% during normal operations. Then every half hour or so, another wave of 100% CPU cycles and a jump in memory consumption, then after a few minutes more CPU cycles and a steep fall in memory consumption. This would go on, wave after wave, hour after hour.
The problem seemed unrelated to the number of concurrent user sessions (approx. 500 at peak), and most of the time the wildly excessive memory consumption didn’t even seem to cause any problems. Yet some times - a couple of times a week - the server response times would drop dramatically or even grind to a halt, yet no OutOfMemory exceptions were ever thrown.
I was very determined to fix the issue, because our client expects the website, which is currently in its infancy, to scale to almost a million registered users within the next couple of years.
As a short term fix, the server was scaled up to 16 Gb RAM, based on the old “when in trouble, double” advice that everyone always gives without solving the underlying problem. Well, surprise surprise, the problem scaled up as well - now the w3wp.exe could consume 7-8 Gb RAM almost right after IIS was started. In waves, just like before.
Using perfmon, I was able to determine that the 7-8 Gb of RAM was mostly to be found in the regular heap in generations 1 or 2. In other words, it seemed that a lot of small objects were being created (since the memory was not used by the large object heap). These objects were being used for a little while, then moved through the garbage collection system. That explained why there were no OutOfMemory exceptions - the memory was actually ready for garbage collection, there were just too many objects to handle it efficiently, so things got slowed down.
Next on the list was to get a fresh copy of one of the world’s most complicated debugging tools, WinDbg. Along with “sos.dll” (found in your .NET framework folder), it can be used to debug .NET memory usage, among other things. Using ADPlus, I created a memory dump of the w3wp.exe process (all 7 Gb of it), and proceeded to attempt to decipher what it all meant.
I should say right now that if you came here looking for advise on how to use WinDbg, you’ve come to the wrong place. The right place is http://blogs.msdn.com/tess/ There’s really no point going anywhere else - Tess is who you want. Either her or call Microsoft Premier Support, which is what I ended up doing.
Back to the matter at hand. Using WinDbg and with the help of a very nice guy from MPS, I was able to determine that my memory dump showed an abnormal amount of one specific type of object being created - SD.LLBLGen.Pro.ORMSupportClasses.EntityField. More than 7 million of these objects were lying around in memory.
For those who don’t know, LLBLGenPro (www.llblgen.com) is one of the world’s most productive Object-Relations Mapping tools for .NET development. I love it. It has literally saved me hundreds of hours of work on previous projects. It allows you to write very elegant and strongly typed code against any database engine (DB2, Oracle, MySql, MSSQL, etc), and supports lazy loading, which was used on this particular project.
This time, however, it seemed that I got bitten in the ass by LLBLGenPro. Somewhere, something was really malfunctioning, but based on previous experience I was pretty sure that LLBLGenPro was not directly to blame. Rather, the problem was probably caused by bad website code.
What to do? Well, WinDbg allows you to look at any object in the heap you like, using obscure commands such as “!dumpheap -stat” and “!dumpobject <address>”. However, I had 7+ million of these objects, and in all likelyhood some of those were bound to be fine/expected, while others were not. So looking at each individual object was a no-go.
Enter my new best friend, CLR Profiler. It’s a world class tool for debugging .NET applications that almost nobody uses, except those who are into .NET games progamming with XNA (like me). But I really think that every .NET developer should grab a copy of CLR Profiler here:
Don’t be scared by the interface or the docs, which are a little hard to comprehend. Just fire up CLRProfiler.exe and hook it up to the .NET application you want to profile. In my case, I hooked it up to the problematic ASP.NET website on a dev machine.
After starting up the tool, I hit the front page of the website a few times and then took a look at the CLR Profiler’s summary page. It looked like this:
The first thing to notice is the amount of garbage collections that had taken place. I had only hit the website a few times, yet generation 0 had already been collected 156 times, and generation 1 more than 50. Far too much, considering that the machine I tested this on had 1 Gb of RAM and the website is fairly simple.
Remember that garbage collection only takes place when it is needed. It’s not like it is scheduled to run every 2 seconds. When writing XNA games, you strive to do all your allocations at the start of the game, then never allocate another object again while the game is being played. That way, you don’t trigger any garbage collection, which causes “hiccups” because the CPU works hard during GC.
When looking at that summary, you may feel like you are miles away from a solution. Yet, the solution is only about 5 clicks away..!
Clicking on Timeline presented me with a simple (ok, insanely complex) looking overview of the garbage collection that had taken place. It looked like this:
The trick here is to select as much of the timeline as you’d like to investigate. The list on the right side of the window will then be sorted according to which type has been allocated the most. In my case, it showed SD.LLBLGen.Pro.ORMSupportClasses.EntityField, just like what I found with WinDbg.
Now here’s why CLR Profiler is my new best friend. You can simply right-click a type and select “Show Who Allocated”. After a few seconds of work, you get the following display:
How cool is that? It’s a walkthrough of the .NET application, which shows clearly how the data got to be allocated. Scrolling a bit, I found my culprit:
One simple little method, AccountManager.GetAvailableCredits() was causing all this fuss. Nothing left to do except to jump into the code, find the mentioned method and see what it did.
It turned out that to calculate the sum of available credits, LLBLGenPro’s fancy lazy loading technique was being used to read all account lines for the user, inflate each one and sum up the amounts to get a total. It looks really neat in C# code, but is a huge performance killer. LLBLGenPro offers many other ways to solve this problem, or a stored procedure can be used. Just never use lazy loading for aggregating data.
Did any of this make sense? If not, I apologize. The main lesson here is to grab a copy of CLR Profiler, read the docs, view the webcast and start optimizing your code!




Posted by Claus Topholt