Archive for October, 2008

Project Euler Problem 7

Problem 7

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.

What is the 10001st prime number?

Ah, what a nice, straightforward, unambiguous spec! If only business software specifications were so precise.

Way back in problem 3, I took a bit of a wander off-topic and built a prime generator in .Net using the Sieve of Eratosthenes. Armed with this, problem 7 should be easy, right? The sieve implementation generates an IEnumerable, which is non-indexable (i.e. I can’t just say Primes()[10001]), but I can take the first 10,001 and then ask for the last element, which will be the answer to the problem.

There’s a problem with this, however. The sieve requires an upper bound during initialisation. This means it’s great for solving problems like “generate all the primes less than 10,001″, but not so great at answering questions like “what is the 10,001st prime number?”, since it requires foreknowledge of the upper bound.

To illustrate the problem, I’ll take a wild guess at the upper bound. I’m going to guess that the 10,001st prime number is less than 99,999. What happens?

var sieve = new SieveOfEratosthenes(99999);
var result = sieve.Primes().Take(10001).Last();

This generates an answer of 99,991. If I enter this into the Project Euler website, however, it tells me the answer is wrong. Gah! What went wrong? A simple test reveals the problem:

var sieve = new SieveOfEratosthenes(99999);
var primes = sieve.Primes().Take(10001);
var count = primes.Count();

There’s only 9,592 primes generated! As the docs for Take() state (emphasis mine):

Take<TSource> enumerates source and yields elements until count elements have been yielded or source contains no more elements.

Damn. So, looks like my 99,999 guess was too small – with that as an upper bound, the sieve only finds 9,592 primes, and I need the 10,001st. OK, I’ll bump it up by an order of magnitude:

var sieve = new SieveOfEratosthenes(999999);
var result = sieve.Primes().Take(10001).Last();

This gives me the correct answer. Not exactly a wonderful solution though; the idea of having to guess the upper bound is pretty horrendous, and if this was real code it wouldn’t be particularly maintainable – what if the requirements changed and we had to find the nth prime, which happened to be >99,999? We’d have to guess again. Ugh.

Worse, the sieve algorithm precomputes all the primes up to the specified upper bound, meaning that in the above approach I’ve asked the sieve to generate primes up to 999,999 (all 78,498 of them!) despite only needing 10,001. Not very efficient.

Fortunately, the upper bound can be calculated separately. Where n>8601, as in this case, we can use the following equation:

p(n) < n (loge n + loge loge n - 0.9427)

where p(n) is the nth prime number.

Alternatively, for flexibility in handling n<8601, we can use the less accurate

p(n) < n loge loge n

which works for n>5. We can easily precompute the answers for n<=5, or simply calculate on demand.

The formula can be implemented on the sieve class, with a factory method to help when we want to use it:

public static SieveOfEratosthenes
    CreateSieveWithAtLeastNPrimes(int n)
{
    return new SieveOfEratosthenes((long)
            Math.Ceiling(UpperBoundEstimate(n)));
}
 
private static double UpperBoundEstimate(int n)
{
    return n * Ln(n) + n * (Ln(Ln(n)));
}
 
private static double Ln(double n)
{
    return Math.Log(n, Math.E);
}

This leaves us with an overall solution like so:

var sieve = SieveOfEratosthenes.CreateSieveWithAtLeastNPrimes(10001);
var result = sieve.Primes().Take(10001).Last();

This generates a total 10,018 primes, cutting the wasted effort from almost 70,000 superfluous primes to just 17, and takes around 20ms to execute on my machine. Plenty fast enough, I think.

  • Share/Bookmark

Look Before You Look Before You Leap

Generally, I try to avoid turning this blog into some sort of snark-fest about other programmers or blogs. I’ve disagreed with Jeff Atwood once or twice though, and so by posting this I’m probably straying a little close to the edge…but what the hell.

A couple of days ago Coding Horror carried a fluff piece about how all developers should be marketers too. Predictably, the article soon got posted to proggit where it was ripped on by reddit’s resident Jeff-haters, and even more predictably the comments were a mix of interesting insight and barely-concealed hate.

Apparently some of them got up Jeff’s nose a bit, and today he responded. The core of his rebuttal seems to be that you shouldn’t trust what you read on blogs, and should verify everything yourself. True enough, I guess, if perhaps a bit impractical given the sheer amount of information out there.

Then, however, Jeff goes on to give an example by referencing a compression benchmark he’d read on a blog and providing counter-analysis to show that the benchmark was wrong in claiming Deflate is faster than gzip. In doing so, much knowledge was gained.

Or so we are told.

The comment thread quickly becomes a goldmine of humour. Bugs in Jeff’s benchmarking code (not resetting the stopwatch) meant that the durations were cumulative, not independent, with inevitable distortion of the results. Another commenter pointed out that gzip cannot possibly be faster than Deflate, since the gzip algorithm IS the Deflate algorithm plus some additional computation.

“gzip” is often also used to refer to the gzip file format, which is:

  • a 10-byte header, containing a magic number, a version number and a timestamp
  • optional extra headers, such as the original file name,
  • a body, containing a DEFLATE-compressed payload
  • an 8-byte footer, containing a CRC-32 checksum and the length of the original uncompressed data

http://en.wikipedia.org/wiki/Gzip

With the benchmarking code fixed, we see that Deflate is indeed slightly faster than gzip.

All of which leads to repeated quotations from Jeff about the community being smarter than him, and some drastic toning down of language in post-publication edits to the article. I read a cached version of the RSS feed, which is markedly different to the article currently live on codinghorror.com – “on my box, GZip is twice as fast as Deflate” becomes “on my box, GZip is just as fast as Deflate”, “Deflate is way slower. It’s not even close” becomes “Deflate is nowhere near 40% faster”, etc.

Anyone who’s tackled a major performance problem will likely agree that profiling is a tremendously valuable technique that should always be applied before attempting to optimise (i.e. look before you leap). I think this little episode has highlighted a couple of important things to bear in mind, however:

  1. Profiling isn’t a magic wand – if you use buggy profiling code, you are leading yourself up the garden path.
  2. Profiling is less useful when you can reason (in the mathematical sense) about the code. That involves understanding the algorithms you are dealing with. Gzip is Deflate plus a bit more processing – so unless that extra processing has a negative duration gzip must necessarily take longer. You don’t need a profiler to work that out. Look before you look before you leap.

Anyway, enough hatcheting from me, normal service will be resumed shortly.

  • Share/Bookmark