Quality Control

Friday, 1 August 2008
by adrian
filed under About Us
Comments: 51

[Suggested listening while reading this post: Quality Control – Jurassic 5]

Prior to moving to London to join Last.fm I worked on credit card software for a leading international bank. When it comes to dealing with people’s money there isn’t much room for mistakes and buggy code can have major consequences. For these reasons there were a number of processes and systems in place to reduce the likelihood of software errors.

Despite what some of our more critical users may think, we do actually have a number of similar systems (and some novel additions) in place at Last.fm. We use software like Cacti, Ganglia, Nagios and JMX to monitor many aspects of our running infrastructure and the results are made available in a number of ways – from coloured graphs to arcanely-formatted log files. So much information is churned out that one could easily spend all day just looking at all the output until one’s mind buckled under the data overload. For this reason we selectively take the most vital data (things like database load, web request times, uptime status of core machines) and display these on eye-catching displays in our operations room.

Status display screens.

The setup shown above is great for being able to look up and get a quick feel for the current state of our systems. Blinking red and graphs with huge spikes are rarely a good thing. In addition to these displays we also have a number of alerts (e-mail, sms, irccat) that get triggered if things go wrong while we are away from the screens (yes, it does happen). There is nothing quite like the joy of being woken up in the early hours of the morning with a barrage of text messages containing the details of each and every machine that has unexpectedly crashed.

While all of this is very useful for keeping an eye on the code while it is running, it’s also good to be able to put the code through some checks and balances before we unleash it on the wider world. One means to this end is the venerable Hudson – a continuous integration engine that constantly builds our software, checks it for style and common coding errors, then instruments and tests it and reports on any violations that maybe have been introduced since the last time it ran. We have over 30 internal projects that use Hudson and a few thousand tests which run over the code. Hudson comes with a web interface and can be configured to send email when people “break the build” (e.g. by making a change that causes a test to fail). We decided that this wasn’t nearly humiliating enough and followed this suggestion (our setup pictured below) to introduce a more public form of punishment.

The bears that haunt our developer’s nightmares.

These 3 bears sit in a prominent position and watch our developer’s every move. When things are good we have a green bear gently glowing and purring, when changes are being processed a yellow bear joins the party, and if the build gets broken the growling evil red bear makes an appearance. The developer who broke things usually goes a similar shade of red while frantically trying to fix whatever was broken while the others chortle in the background.

Amid all this hi-tech digital trickery, it is sometimes nice to be able to cast one’s mind back to the simpler analogue age and the measuring devices of the past. For example, we hooked up an analogue meter like those used in many industries for decades, fed it some different input and ended up with a literal desktop dashboard that measures average website response time.

Web response time meter.

It is strangely mesmerising to see this meter rev up and down as website demand changes over the day (or we manage to overload our data centre’s power supply and a significant portion of our web farm gets to take an unexpected break from service).

On the whole we have a great variety of options for keeping our eyes on the quality prize, thanks in no small measure to the efforts of the open source software community who crafted all the software I have mentioned. Of course the biggest challenge to ensuring quality is still the human component – getting people to actually use these tools and instilling the desire and motivation to make software as bug-free as possible. If any of you out there use similar tools that you are passionate about let us know. I’d also love to hear if anyone has any other amusing or original systems to keep quality control fun and fresh. For me, I’ve got a glowing green bear to keep company….