### Why you should take statistics with a pinch of salt

There are lies, damned lies and statistics. Earlier this week, I was changing the implementation of a frequently called method which used a simple boolean flag to denote status for collection of objects. My aim was to add status indicators at the lowest level of granularity. This would mean the evaluation mechanism would have to iterate over this collection to provide status at a global level. My chief concern with the design was that it may introduce an overhead the transaction -- of which our method in question is a part -- could not sustain. I then proceeded to write a simple test to time the method.

I occasionally use the arithmetic mean approach to timing methods. Its quite simple and goes something like this:

Now this may not be the best approach, but it works for me. I had assumed that if you used a large enough sample, you could get pretty accurate results. Except, this time, while I was demonstrating the result to Bill Hubertus, a disturbing though struck me. Statistical approaches such as this could so easily be misused by people, if they so desired. This was hardly an epiphany though. In an instant I recalled all those J2EE v/s .NET showdowns and imagined crooked old nerds concocting their slimy recipes to skew the results one way or the other. But no, this action hero is not out to rid the earth of its scum. So back to our story...

In fact the arithmetic mean approach itself is awfully flawed. For those that took a photography class in place of statistics, here's a refresher: the

The

And then there is the

So the next time you see a report throwing statistical

If you have any pet tricks for measuring method execution times reliably, please comment. But keep them honest.

I occasionally use the arithmetic mean approach to timing methods. Its quite simple and goes something like this:

long startTime = System.currentTimeMillis();

for (int i=0; i<1000; i++) {

methodUnderTest();

}

System.out.println("Avg. method execution time=" +

((System.currentTimeMillis() - startTime)/1000));

Now this may not be the best approach, but it works for me. I had assumed that if you used a large enough sample, you could get pretty accurate results. Except, this time, while I was demonstrating the result to Bill Hubertus, a disturbing though struck me. Statistical approaches such as this could so easily be misused by people, if they so desired. This was hardly an epiphany though. In an instant I recalled all those J2EE v/s .NET showdowns and imagined crooked old nerds concocting their slimy recipes to skew the results one way or the other. But no, this action hero is not out to rid the earth of its scum. So back to our story...

In fact the arithmetic mean approach itself is awfully flawed. For those that took a photography class in place of statistics, here's a refresher: the

**arithmetic mean**is the standard average, in that you take a sum of all the numbers in a sequence and divide them by the number of items in that sequence. If you have a sequence of {1, 2, 3, 4, 5}, the mean is 3. Looks good, eh? But if you calculated the net worth of say, Bill Gates, Warren Buffet and Ashish Shetty, you'd be misled into believing all three were members of the Forbes 500 club. For the record, it would be money lost for Mr Gates and Mr Buffet if they even bent down to pick up a bag of cash equivalent to my annual paycheck from the pavement. Using the arithmetic mean then skews the results higher than they actually are. Which itself is not a problem for my test above. All it does it show a higher execution time; not quite as evil but misleading nevertheless.The

**median**is another way to look at statistics, perhaps a better one. It is the number that is exactly midway in a sequence of numbers. It is the value below which 50% of the scores fall (and need I add, the value above which 50% of the scores fall. ). When there is an even number of scores, the median is the mean of the two centermost scores.And then there is the

**mode**. It is a measure of frequency of occurence. So the most frequent occuring number in a sequence is the mode. I do not know if this is practical in timing methods.So the next time you see a report throwing statistical

*inswinging yorkers*your way, think of this: 98% of all heroin addicts started out by drinking milk. Go figure! And my personal favorite (I use this to discredit most claims): 47.5% of all statistics are made on the spot.If you have any pet tricks for measuring method execution times reliably, please comment. But keep them honest.

## 3 Comments:

jdk5.0 has a new method that's called something like System.nanoSeconds (I don't remember the exact name) which provides much, much more accurate timing.

HOWEVER, your statistical mean may be the best way to go anyways. Just about a month ago, I measured the time it takes to load a very smile icon files from the hard disk. It took around 300 milliseconds! I was shocked - I mean, that's approachable user-noticeable time for this tiny little file!

After talking to several people about it (thus making a complete fool out of myself when I later had to tell them I was wrong) I was struck by the urge to try loading several different icons, and seeing how long that took (Each icon was in a different file on the disk). Well - the first icon took around 300 milliseconds to load, but all the rest took between 20 and 40 milliseconds.

I'm guessing that all that extra time might have been used loading the large number of class files that were needed to do the disk i/o and doing the turn-the-png-file-into-bitmap-format-in-memory conversion. (fyi I was never timing jvm startup time, so that wasn't an issue.)

So just a warning - the "first run" may not produce "typical" results.

By Anonymous, at October 23, 2004 at 11:47 AM

citing an average without an accompaning standard deviation is like asking for 50. ;)

By Anonymous, at October 24, 2004 at 10:52 AM

When running performance tests, I like to compare versions of the code or different approaches.

One version may require a sample size of 100 iterations, while another might require 1,000,000. I don't like changing the test case to compensate for variations. Instead, I prefer a test case that runs for n seconds. I set up the test so it runs for 1, 2, and 4 seconds and if the results are similar, which they almost always are, the test is a success.

Implementation: estimate how many iterations would be required, then run the test for that many. If it wasn't enough then re-estimate with an additional 0.1 seconds and run it again. Finally, report results in Transactions Per Second.

By Anonymous, at November 2, 2004 at 5:47 AM

## Post a Comment | Home | Inference: my personal blog