OpenCover Performance Impact (part 2)

I think I now have a handle on why I was getting the results I earlier reported i.e. OpenCover and PartCover were not some magical performance boosters that added Go Faster stripes to your code.

After a heads up by leppie and his investigations of using OpenCover on his IronScheme project I realised that I needed to spend some time on optimizing how I get data from the profiler and aggregate it into the report. In case you are wondering IronScheme test that took just shy of 1 minute to run on my machine took over 60mins when running under the profiler, Ouch!

The problem

First of all I should explain what sort of data OpenCover gathers (and why) and then I can describe what I did to improve performance. OpenCover records each visit to a sequence point and stores these visits into shared memory; I did it this way as I am hoping to be able to use the order of visits for some form of path coverage analysis at a later date. After 8000 visits it informs the host process that there is a block ready for processing. The host takes this block, makes a copy, releases the shared memory back to the profiler and then processes the data. After processing the data the hosts then waits for the next message. It was this latter stage that was the bottleneck as the host was spending too much time aggregating the data that the profiler was already ready with the next 8000 points.

An (interim) solution

I say interim solution as I am not finished with performance improvements yet but decided that what I had implemented so far was okay for release. First I looked at how the results were being aggregated and noticed that a lot of the time was being spent looking up the sequence points so that the visit count could be updated, I switched this to a list and mapped the visit count data to the model at the end of the profiling run. This helped but only by bringing the profiling run down to ~40mins.

I realised that I just had to get the data out of the way quickly and process it at a later date, so I added a processing thread and a ConcurrentQueue. This was an interesting turn of events as the target process now finished in 4 mins but the host took nearly 40 mins to process the data and the memory usage went up to 2.5GB and a backlog of 40K messages. Hmmm....

After some toying, whilst looking for inspiration, I noticed that the marshaling of the structure (2 integers) was where most of the time was spent. I switched this to using BitConvertor, which also meant that I could avoid the memory pinning required by the marshaling. Now the target process still ran in just under 4 mins but the backlog very rarely reached 20 messages and memory usage stayed at a comfortable level (<100MB) I decided this was enough for now and released a version of the profiler.

But what about the earlier results?

Those earlier results though were still are cause for thought. Why should the OpenCover dogfood tests be faster but the ironscheme test be so much slower. Well the IronScheme tests were doing a lot of loops and were running parts of the code many 1000's of times whereas the dogfood tests were unit tests and the code was only being run several times before moving onto the next test fixture and next section of code. I am now thinking that the issue is due to the optimization that is normally performed by the JIT compiler, but is turned off by the profiler i.e. when running the tests (without profiler) the JIT compiler spends time optimizing the code but the time spent is not recovered as the code is not run enough times to get a net gain, compared to when the JIT compiler just compiles the non-optimised modified code that the profiler produces.

So in conclusion you may see some speed improvements if running tests where your code is only visited a few times but if you are doing intensive execution of code then don't be surprised if the performance is degraded.