Monthly Archives: August 2008

Automatic garbage collection is simultaneously the best and worst part of implementing programs in managed code. On the positive side, garbage collecting makes the writing of the program easier by freeing you from the need to explicitly manage memory and it thus makes the runtime more stable as it prevents most memory leaks (Self-referential lists or cycles aside). On the negative side garbage collection takes up a finite amount of time and necessarily is intrusive, as it resets pointers as it compresses heap space. The amount of time spent in garbage collection is precisely the drawback of automated garbage collection : there are no practical bounds to the amount of time needed nor how often the garbage collector will run. In languages that only support a limited number of options for the garbage collector (e.g., .NET CLR and Java pre-1.4) the problem can be even worse as one scales an application to multiple cores (there is a good document at Sun’s website that discusses the linearity problems of garbage collecting and Ahmdal’s Law).

As a practical matter, the garbage collecting time has rarely entered into the average programmer’s consciousness. The issues with garbage collection time and throughput only reveal themselves in a subset of problems, namely highly transactional, stateful, multi-core systems. It is precisely these kinds of systems that current Service Oriented Architectures represent, making them susceptible to problems with throughput.

I ran into this problem with a C#/.NET based system I developed at RedLasso to handle bandwidth and other resource constraints while loading a processing grid. The issue in this case was handling the state of the system to determine which nodes had access tokens and which nodes had queued requests for those tokens. As a pure processing problem this would not even begin to tax the processor on the reference quad core system I was using, since the task simply required accepting messages and pushing and popping from a FIFO queue then sending messages, at a moderate messaging rate of 100msec average. Early tests showed that on a dual core system, the message rate could be maintained somewhere on the order of 1msec average before the processor started to become a bottleneck, so assuming a linear processing model, a real world rate of 100msec should translate into 1/100 of processor resources, and potentially even ½ of that on a quad core system, if the processing itself can be efficiently threaded. In the this case, I was relying on the WSE framework’s thread pools for incoming messages to take advantage of message parsing, while my main worker thread would be the linear part that handled the synchronized queue management. What I found was that the total CPU usage was nearly 70%, and using the performance monitor I found that upwards of 80% of the time was spent in the garbage collector. Changing the flavor of the .NET garbage collector from server to workstation, both concurrent and not, did not make a significant difference.

To try to isolate the problem I created a standalone program that attempts to mimic the message and queue handling. The program consists of a configurable number of producer threads and a single consumer thread. The producer threads create a random sized string and drop those messages on a thread safe queue. The consumer simply pops the data off of the queue and continues on. The production rate is configurable and causes the producers to sleep a random amount of time, centered on the configured rate; if for example the rate is 50 msec, each producer thread will produce a message then sleep for a random amount of time between 0 and 100 msec. The consumer runs at a the configured rate divided by the number of producers, to try to ensure that on average the consumption rate at least matches the production rate. With this program completed, I ran a test on a dual core machine with a 50msec average duty cycle and 16 producer threads and using the windows performance monitor I captured the percentage of time spent in garbage collection as well as the sizes of the various heaps. I also noted the absolute amount of processor time noted by the task manager. The results were quite astounding. On average, the program was using ~10% of *each* processor and spending 6% of the time garbage collection. Changing to the server version of garbage collection reduced the average percentage time to ~2.3%, but did not have any apparent impact on the overall CPU utilization (

Performance monitor using workstation GC

Performance monitor using workstation GC

). Finally, the workstation version had a peak time of 99 % in the garbage collector. The server version seems to peak closer to the 10% range. This number is important because of latency, which I will discuss in the next installment. Remember that the program is simply moving data into and out of a queue and not actually doing any processing other than to create random strings to simulate data packets. Furthermore, the performance counters showed a large amount of collections happening in the generation 1 and generation 2 heaps. Due to the random nature of the producers, I believe that many objects are getting promoted to gen1/gen2. In addition, since the queue itself is a gen2 object, all accesses to the queue will mark it as written and thus require a gen 2 collection, which is more expensive than a gen0 collection. A trace of my runtime system show further calls to the garbage collector, and has the added issue of a large object heap that changes sizes.
It appears that the pattern I use that has work items on queues and worker threads processing those queues, rather than one thread per message, does not work particularly well with the .NET garbage collector in a highly transactional system. This is highly surprising considering this is exactly the pattern that Microsoft calls “port completion” (and I believe claims to have invented).

Given the production issue that I saw and the results from my tests, I decided to compare the performance of the same scenario in another managed environment. I transliterated the code into java and ran the same tests, using the three different garbage collection options available. In each case, the overall cpu utilization was much lower (closer to 1% than to 10% on the task manager), but I got varying numbers for the amount of time spent in garbage collection :

. I used the visualgc tool that comes with jvmstat, from Sun, to monitor the garbage collection times and heaps, which is similar to the performance monitor tool that comes bundled with Windows XP. The results were interesting in all three cases. The default garbage collector performed the worst, spending roughly 0.33% of the time in garbage collection, while doing 1894 collections over a span of 5 minutes. The parallel garbage collector (-XX:+UseParallelGC) was marginally better, spending ~0.29% of the time in garbage collection, with 888 collections over 10 minutes. Finally the concurrent garbage collector (-XX:+UseConcMarkSweepGC) spent only 0.10% of the time in garbage collection, with 303 collections over ~6.5 minutes. In all cases, the java version ran much “better” than the .NET version; a word of caution is needed however, in that there is no good indication of the peak time spent in garbage collection, again it is a question of latency, particularly in transactional systems.
The results of this test make it clear to me that when dealing with message passing distributed systems, managed environments do not end up scaling as well as would be preferred when running on multiple cores. Since the system I just described is what most SOA systems are turning into, the throughput and latency effects introduced by garbage collectors will need to be taken into account when designing and implementing systems, even to the point of choosing one platform over another, which is ironic considering that managed environments are supposed to ameliorate the risks associated with memory management in a system, and make the choice of platform more independent. I would further conclude that until Microsoft opens up the garbage collector or has better implementations of garbage collecting, runtime systems that rely on low latency will continue to suffer if implemented in a .NET runtime.
All code is available on google code at : http://garbagecollectioncomparison.googlecode.com/svn/trunk/
also note that all tests on .NET were done using release builds, not debug builds.
Please grab a copy of the code and run it for yourself, tweak it, and let me know what you find.