Archive for the ‘Software Engineering’ Category

Test-specific Shims in Production Code

Monday, April 21st, 2008

We’re currently on a fairly major kick to increase automated test coverage of our software. This doesn’t just mean ‘get the unit test coverage up to scratch’, it also means we are working towards full end-to-end integration testing using, amongst other tools, some front-end automation tools such as QTP and Selenium.

Of course, nothing is ever easy when trying to polish away the tarnish of ancient code. One particular problem we face regularly is patching up code that breaks the fragile expectations of some of these automation tools.

Some of our applications - including the one I am working to refactor - contain UI widgets that use a lot of custom painting routines and conceal data pretty well. One widget, for instance, needs to display data with a fast refresh rate and so uses a double-buffered approach to avoid flicker. The data it displays, however, is not stored anywhere; it is discarded as soon as it is rendered. And since the whole widget view is rendered as a bitmap and blitted to screen, there’s no convenient hierarchy of panels, labels, text boxes, or any other standard controls.

This, the Automated QA folks tell me, causes a problem since QTP mainly works by reflecting on properties exposed by controls to get at their data. So, if QTP wants to read some data from a text box, it accesses the Text property of that text box. Simple. But this particular widget doesn’t have the equivalent of a Text property.

This isn’t really an oversight from a purely functional point of view, since no part of the actual application code ever needs to get data from the widget - it’s a display mechanism only, not an interactive widget like a text box. Data is received from a web service, processed a bit, and dumped into the widget. The widget is the last object to do anything with the data - no other part of the app ever needs it again.

Since there are no properties on the widget exposing the data, QTP can’t get at it.

Of course, there are ways to keep QTP happy. We can add a few properties to the widget and keep some data around in member variables, or we can write some extensions for QTP that allow it to access some of the widget’s internals. The second way is probably the ‘right’ way since it keeps test-related code external to the application code - but it’s more time-consuming, and also has a training cost since most developers aren’t going to be familiar with QTP’s API.

This leaves the first option. Traditionally I’ve always been a bit wary of having what is effectively test code (since it only exists for testing purposes) deployed with production code. Furthermore, doesn’t it undermine the tests themselves, since they are dependent on code that never gets executed in production?

On the other hand, in some instances it may be the more pragmatic thing to do. It’s difficult to justify spending a day or two writing a few hundred lines of QTP extension code when the same effect can be garnered by adding a single read-only property. It still doesn’t quite sit right for me though, and I can’t find much in the way of authoritative literature that argues one way or the other.

Project Euler Problem 3

Monday, April 7th, 2008

Next up in the list of Project Euler problems is this one:

Problem 3

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143?

This, obviously, is a factorisation problem. There is a colossal amount of material on the web for dealing with prime factorisation - a simple google search pulls up lots of information. Prime factorisation (and the difficulty of doing it with sufficiently large numbers) is at the heart of the cryptographic methods we currently use on the internet - every time you buy something from Amazon, you are protected by the fact that evil black-hats can’t find the prime factors of your encryption key fast enough to steal your credit card number (OK, bit of a generalisation, but that’s the gist).

One of the key phrases in the above paragraph is ’sufficiently large numbers’. For a computer, 600851475143 is not a particularly big number, so this problem can be brute-forced fairly easily. Of course, not all brute force approaches are created equal. The most naive algorithm would be something along the lines of a three-pass sweep - firstly test every single number between 2 and 600851475143 to see if it divides cleanly into 600851475143 (pass 1); then test each factor from pass 1 to see if it is prime (pass 2); and finally take the biggest of the pass 2 numbers to get your answer (pass 3).

This would work, but it sucks.

Fortunately, it’s easy to optimise. Let the prime factors of our number N be f1, f2 … fn. If I start with the lowest prime number and work up from there looking for a factor, I know that the first factor I find will be prime (since if it wasn’t prime, it would have factors of its own, which by definition would also be factors of our target number). This number is f1. I can divide the target number by f1 and then factorise the result to find f2. Continuing this process will result in a list of prime factors, and then it’s simply a case of selecting the largest.

I can optimise further by not resetting the factor to the lowest prime number each time - since having found f1 I know that there aren’t any smaller factors, so I don’t have to waste time looking for them. Here’s the implementation in python:

def primeFactors(n, factor):
    factors = []
    while (n % factor != 0):
        factor = factor + 1

    factors.append(factor)

    if n > factor:
        factors.extend(primeFactors(n / factor, factor))

    return factors

print max(primeFactors(600851475143, 2))

Note that in the recursive call the current factor is retained, so that the code doesn’t repeat itself.

This executes pretty quickly, but it could be better. For a start, since 600851475143 is odd there’s no need to start with the only even prime number (2). Instead, I could just start at 3, and in the while loop skip over even numbers. This would cut the number of tested numbers in half.

A more efficient trial division approach, however, would be to generate a list of primes, divide 600851475143 by each prime to find the prime factors, then simply select the largest. To use this solution, a prime number generator is needed.

This is an interesting diversion - I’ve peeked at some of the other Project Euler problems and know that prime numbers will pop up again, so it may prove useful to have a generator handy for when I get to those. Some languages, like Ruby, have library functions that can give you primes, but other languages don’t. If you’re not interested in generating primes and just want to know the answer to problem 3, execute the code above and you’re free to get down from the table.

A Random Walk Off-Topic

The simplest way to generate primes is known as the Sieve of Eratosthenes after the Greek mathematician who invented it. In principle it’s straightforward - take a list of all integers up to an arbitrary limit, then starting from 2 (the smallest prime), mark all the numbers that are multiples of 2. Then move to the next unmarked number (i.e. 3) and mark all the multiples of 3. Then you move to the next unmarked number (5, since 4 was marked as a multiple of 2) and mark all multiples. And so on, until you get to the end of your list. Whatever numbers remain unmarked are all the primes up to your arbitrary limit.

.Net lacks a built-in prime generator, so to demonstrate the algorithm I’ll create a simple C# implementation. The list of numbers is represented as an array of booleans, all set to true by default except indexes 0 and 1 (since we aren’t interested in evaluating those numbers as prime).

The other requirement for a funky contemporary .Net implementation is, of course, to expose the results with IEnumerable. This achieves two things - firstly, it lets the sieve class control enumeration and thus skip over the marked numbers (making the calling code cleaner), and secondly it lets me use LINQ to query it.

So, here’s the code:

public class SieveOfEratosthenes
{
    private bool[] m_numbers;

    public SieveOfEratosthenes(long limit)
    {
        m_numbers = new bool[limit + 1];
        for (long l = 2; l < m_numbers.LongLength; ++l)
        {
            m_numbers[l] = true;
        }

        for (int i = 2; i != -1;
                i = Array.FindIndex(m_numbers, i + 1,
                    b => b == true))
        {
            for (int j = i * 2; j < m_numbers.Length; j += i)
                m_numbers[j] = false;
        }
    }

    public IEnumerable<long> Primes()
    {
        for (long i = 2; i < m_numbers.LongLength; ++i)
            if (m_numbers[i])
                yield return i;
    }
}

Fairly straightforward. Basically, I start by marking 2 as prime. Then, an inner loop sets all multiples of 2 to false, since no (other) even numbers are prime. Each time round the loop, we find the next true element of the array (which will have a prime index), and mark all multiples false as per the description above. The loop terminates when FindIndex fails to find any more true elements.

This results in an array where the only elements with a value of true are those with a prime index. This makes the actual IEnumerable generator very easy to write - it yield returns the index whenever it finds a true element.

There’s a problem with this code, however, that makes it unusable with Euler problem 3 (at least in .Net - hence why I called it a ‘diversion’ earlier, rather than an alternative solution). In .Net, you can’t create an array with 600851475143 elements, since an array with 600851475143 elements is way above the maximum array size limit of 2GB. Even if each element is only a single byte, 600851475143 bytes is about 560GB.

Therefore, you can’t create a sieve big enough to solve the problem.

When using trial division it seems that it is enough to only generate primes up to the square root of N, though there are cases when this is not true (e.g. where N=15, sqrt(N) = ~3.873, but the largest prime factor of 15 is 5), and I don’t have the maths (yet!) to know how big a fudge-factor is needed. I’ve seen a solution on the Project Euler forum that generates primes up to sqrt(N) + 10, which solves the example of N=15 above, but does it solve ALL cases? Another approach might be to generate the list of primes such that the largest prime in the list is the first prime > sqrt(N) - but now I’m completely guessing.

Still, for numbers that fit inside the 2GB limit, we can find the largest factor easily.

var sieve = new SieveOfEratosthenes(15);

Now I have my IEnumerable sieve, I can craft a LINQ query to find the largest factor. All I need to do is filter my list of primes for those which divide directly into 15 and call the Max() method:

return (from p in sieve.Primes()
        where 15 % p == 0
        select p).Max();

Done! And I have a handy reusable prime generator for later on.

The Lightbulb Moment

Tuesday, March 25th, 2008

Weiqi Gao has a post up today discussing the trials of grokking Scala. Scala is a language I want to take a much closer look at later this year, since I want to become current on the JVM again (having not been on talking terms with it since using J2SE 1.4 around the summer of 2003) without being particularly keen on tangling with the Java language itself.

One of the key features of Scala is the functional programming style it brings to the JVM. It’s actually quite common to use certain functional idioms in Java - e.g. passing around a function as a parameter - but the syntax is clunky and verbose (unless and until closures get confirmed in 1.7, that is, and maybe even then).

For example, take a look at this very simple idiomatic code for spinning off a thread to perform an expensive operation:

new Thread (new Runnable() {
    public void run() {
        someObj.doExpensiveOperation();
    }
}).start();

Now that’s not the most hideous code I’ve ever seen, but it’s a bit…wordy. Compare it to this equivalent implementation in Java’s closest mainstream relative, C#:

new Thread(x => someObj.DoExpensiveOperation()).Start();

I much prefer this syntax, even taking into account the throwaway lambda parameter that’s only there to satisfy the ThreadStart signature. The Scala syntax is even nicer, however:

spawn({ someObj.doSomethingExpensive })

This is the sort of thing that piques my interest about the language - expressive syntax and a very funky concurrency model will get my attention, especially when running on something as mainstream as the JVM and with full interoperability with the frankly staggeringly-vast Java library ecosystem. I like F# on the CLR for similar reasons.

But I digress; what I wanted to talk about was a point made by Weiqi, when discussing the pattern-matching capabilities of Scala:

Pattern matching in Scala is exactly the point at which I would spend time trying to understand it, trying to master it, trying to learn to use it. I understand the syntax. I understand the explanation that the speakers in presentations gave. I do get to the part where I say “This is cool.” But I never get to the point where I would see a problem and say “This problem is best solved with pattern matching, let me fire up Scala and code the solution.”

This strikes a chord for me, as I have gone through that stage once or twice myself with other features in other languages and yet can’t quite put my finger on how I get past it. I don’t think it’s something you consciously do - it’s just something you keep grafting away at until suddenly you realise that the technique, whatever it is, has become part of your armoury.

Closures are an obvious example I can think of in my own background. I was raised as a straight-down-the-middle C++ man, way back in the early/mid-90s, cutting my teeth on Borland Turbo C++ 3 on Windows 3.1. When I first started to play with functional languages it took a long time for me to ‘get it’, and even when I understood what a closure was after a couple of weekends hacking around in OCaml, I couldn’t envisage when I’d ever need one.

Soon after, whilst working on a Konfabulator widget in javascript, I noticed I was using them all the time. I suddenly had much more insight into what ruby blocks were doing. It wasn’t so much that I noticed the lights go on - they’d been on for some time and I hadn’t realised.

People commonly refer to the ‘lightbulb moment’ or ‘the lights went on’ as being the point where a flash of inspiration hits and everything suddenly makes sense. I don’t like this metaphor. If I need to go to the bathroom in the middle of the night, when the lights go on I squint in pain and stagger around just as blindly as I did before. But then I acclimatise, and all becomes clear. And so it is, I think, with learning alien concepts - you need a bit of time to adjust to the dazzling light.

Project Euler Problems 1 and 2

Saturday, March 22nd, 2008

Browsing through Nate Hoellein’s blog recently led me to Project Euler. This is a problem - I have a horrendous feeling I’m about to get addicted to it, to the cost of just about everything else that normally occupies my free time. Ack.

Still, at least it provides some blogging material. I’m going to start working my way through the list, and try to create idiomatic solutions in a number of languages. I won’t always look for the most efficient solution, since I’m also interested in expressiveness (see here and here for previous posts on the subject).

To start with, here’s some code and thoughts for problems 1 and 2.

Problem 1

Add all the natural numbers below 1000 that are multiples of 3 or 5.

This is generally regarded as the easiest Euler problem, so shouldn’t present too many problems. Mainstream software development is still dominated by imperative languages and styles, so the most recognisable solution to this would be a straightforward for-loop. Here is an imperative C# solution:

int result = 0;
for (int i = 0; i < 1000; ++i)
{
    if (i % 3 == 0 || i % 5 == 0)
        result += i;
}

Simple enough, but as with all for-loops the guts are a little too visible. I have to explicitly declare and increment an accumulator variable as well as the loop counter. A functional style (Haskell in this case) allows a more declarative solution:

sum $ filter (\n -> n `mod` 3 == 0 || n `mod` 5 == 0) $ [1..999]

Or, with list comprehensions:

sum [n | n <- [1..999], n `mod` 3 == 0 || n `mod` 5 == 0]

In both cases, the loop is replaced by a list generated from Haskell’s range operator. [1..999] creates a list containing every integer between 1 and 999 inclusive. The modulo test is basically the same, though Haskell lacks a modulo operator (% in most C-family languages) so the mod function is used instead.

Just for fun, here’s an F# solution too:

let sum = List.fold_left (+) 0
let mod35 = fun x -> x % 3 = 0 || x % 5 = 0

List.filter mod35 [1..999] |> sum

This could be compressed into a one-liner like the Haskell solutions, but it would be a bit long for my taste. Also note F# is slightly hamstrung by the lack of a built-in sum function, so I have to define my own using fold. Another F# solution is here, but I prefer mine. There’s a very nice snippet in the comments of that page, though, which I like even more:

Seq.fold1 (+) {for i in 1..999 when i % 3 * i % 5 = 0 -> i}

Interestingly, C# is gaining some fairly powerful functional techniques lately, in particular LINQ. I can use the new Enumerable class to mimic range syntax and filter functionality from other languages, and lambdas to keep the code concise.

Enumerable.Range(1, 999).Where(
        f => f % 3 == 0 || f % 5 == 0).Sum();

Note the similarity between F#/Haskell’s lambda syntax and that of C#. It’s very cool that a mainstream C-derivative language is getting this sort of syntax added to it.

Alternatively, I could use LINQ query expressions for a different approach:

var nums = from n in Enumerable.Range(1, 999)
           where n % 3 == 0 || n % 5 == 0
           select n;
nums.Sum();

Fun!

It should be noted that all the Project Euler problems I’ve seen so far have mathematical solutions, meaning if you are able to classify the problem correctly it is straightforward to work out the answer with pen and paper. In this case, the problem is based around an arithmetic progression, and there are powerful formulae for reasoning about those. If you’re interested, check out the forum for problem 1.

Problem 2

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

Find the sum of all the even-valued terms in the sequence which do not exceed four million.

Ooh, Fibonacci. I’ve been here before. Using the Haskell code from that post makes problem 2 a snip:

fib = 0 : 1 : zipWith (+) fib (tail fib)
sum $ filter even $ takeWhile (<4000000) fib

Given the lazy Fibonacci generator in the first line, this just uses standard Haskell functions from the Prelude to do all the work - reading right to left, takeWhile pulls data from the fib sequence until the test fails (i.e. we’ve reached 4,000,000), filter even does exactly what it says on the tin, and sum does the business on the result.

C# could solve problem 1 very neatly - can it keep up the pace in problem 2? Actually, yes it can, after a fashion. The lazy Fibonacci generator can be implemented using the yield statement added in C# 2.0. This is much more efficient than the naive recursive solution I looked at in my previous post about Fibonacci sequences. Once I have the generator, the LINQ statement is very concise and quite similar to the Haskell code - C# 3.0 even has TakeWhile!

IEnumerable<long> Fibs()
{
    long a = 0, b = 1;

    while (true)
    {
        yield return b;

        b = a + b;
        a = b - a;
    }
}

Fibs().TakeWhile(f => f < 4000000).Where(f => f % 2 == 0).Sum();

The more I use C# 3.0, the more I like it - there’s quite a bit of power in there.

As with problem 1, there are some fascinating mathematical tricks that can be utilised when solving problem 2, and I recommend you check out the forum. It’s particularly cool to see how the Golden Ratio can be brought into play when working with Fibonacci sequences - I had no idea these techniques existed. So much to learn!

The P.G. Wodehouse Method Of Refactoring

Friday, March 21st, 2008

I am much given to ruminating on refactoring at the moment, as one of my current projects is a major overhaul of a fairly large (>31,000 lines) application which has exactly the kind of dotted history any experienced developer has learned to fear - written by many different people, including short-term contractors, at a time in the company’s life when first-mover advantage was significantly more important than coding best-practice, and without any consistent steer on the subjects of structure, coding conventions, unit tests, and so on.

In other words, here be dragons.

In fairness, the application works and has been a critical part of a company that has gone from nothing to market-leading multinational in 7 years, so it has certainly pulled its weight. It is in desperate need of a spring-clean though, and my team volunteered to spend 3 months evicting the cobwebs and polishing the brasswork.

Yes, volunteered - it’s a fascinating challenge, though perhaps not something you’d want to make a career of.

Now, the first mistake to avoid here is the compulsion to throw it away and rewrite from scratch. So often when confronted with a vast seething moiling spiritless mass of code a developer throws his hands into the air and declares it a lost cause. How seductive is the thought that 31,000 lines of code could be thrown away and replaced with ~15,000 lines of clean, well-designed, beautiful code?

Sadly, that’s often a path to disaster. It’s almost a rule of the game. jwz left Netscape because he knew their decision to rewrite from scratch was doomed. Joel Spolsky wrote a rant about the same decision - in fact, the Netscape rewrite is commonly cited as a major factor in Netscape losing the first browser war.

The problem is that warty old code isn’t always just warty - it’s battle-scarred. It has years of tweaks and bug-fixes in there to deal with all sorts of edge conditions and obscure environments. Throw that out and replace it with pristine new code, and you’ll often find that a load of very old issues suddenly come back to haunt you.

So, a total rewrite is out. This means working with the old code, and finding ways to wrestle it into shape. Naturally, Working Effectively With Legacy Code now has an even more firmly established place on my ‘critical books’ bookshelf than it did before.

Inspiration came from a less well-known book, however. Buried in Chapter 10 of Code Reading is a single paragraph suggesting that it can be useful when working with unfamiliar code to paste it into a word processor and zoom out, getting a ‘bird’s eye’ view.

One other interesting way to look at a whole lot of source code quickly under Windows is to load it into Microsoft Word and then set the zoom factor to 10%. Each page of code will appear at about the size of a postage stamp, and you can get a surprising amount of information about the code’s structure from the shape of the lines.

(Spinellis, 2003)

The idea is that this lets you immediately identify potential trouble spots - if you see pages where the code is all bunched up on the right, it indicates massive nesting and over-long functions. If you see heavy congestion, it indicates dense code. It’s also easy to spot giant switch statements and other crimes against humanity.

Of course, you don’t actually need MS Word to do this - the Print Preview in Open Office is more than sufficient, and no doubt most office suites can do the same.

This 50,000ft view could be a useful tool in tracking progress. I mean sure, we can have our build system spit out cyclomatic complexity and code size metrics, but wouldn’t it be neat if we could do a weekly bird’s-eye printout of the source code and pin it up on the wall, giving a nice simple visual representation of the simplification of the code?

Except, of course, that with average page lengths of 45 lines we’d need almost 700 pages each time, and a hell of a lot of wall space.

A better solution would be to print a class per page. At the start of the project, the application had about 150 classes, and the refactoring effort is focussed on about 80 of those. Initially, gigantic classes would be an incomprehensible smudge of grey, but as the refactoring process starts tidying the code and factoring out into other classes, the weekly printout would start to literally come into focus, hopefully ending up with many pages actually containing readable code (which happens roughly when the class is small enough to fit on no more than 3 pages at normal size).

The first time we pinned up the printouts, I suddenly recalled a Douglas Adams foreword reprinted in The Salmon of Doubt. Adams was a great fan of P.G. Wodehouse, and explained Wodehouse’s interesting drafting technique:

It is the next stage of writing—the relentless revising, refining, and polishing—that turned his works into the marvels of language we know and love. When he was writing a book, he used to pin the pages in undulating waves around the wall of his workroom. Pages he felt were working well would be pinned up high, and those that still needed work would be lower down the wall. His aim was to get the entire manuscript up to the picture rail before he handed it in.

(Adams, 2002)

Hmm, isn’t redrafting a literary cousin of refactoring? In many ways, I think it is - so why not apply this technique to refactoring?

And we’ve made it so. We tied a piece of string horizontally across the wall - that’s our ‘picture rail’. Every week we reprint the classes we have been working on, and replace the old printouts. Then we move them up towards the string, in accordance with how happy we are with the view.

Obviously, this doesn’t replace all the other tools we have for evaluating code quality - e.g. the aforementioned metrics, unit tests, manual QA, and so on. It does, however, make for a brilliant way of tracking our subjective satisfaction with the class. Software quality tools can never completely replace the gut instinct of a developer - you might have massive test coverage, but that won’t help with subjective measures such as code smells. With Wodehouse-style refactoring, we can now easily keep track of which code we are happy with, and which code we remain deeply suspicious of.

As an added benefit, all those pages nicely cover up the hideous wall colour. Bonus!