Last.fm is all about bringing you music you love, and a big part of that is the technical infrastructure required to make it all happen. We take site and service reliability very seriously, which is why we get upset when things go wrong.
And this week things went wrong. It has been a particularly busy one for us, as we’ve had to deal with a large router failing on us which has unfortunately had user visible impact in a few places. As such, we want to share some of the back story, so you can understand what happened and how we’re working on making things better.
What the Ops team do
Last.fm began life as a small start-up some years ago now, and as is normal for technology based start-ups, reliability at a large scale wasn’t a big concern at the beginning. Start-ups focus on making things work, and worries about uptime and reliability come later. Reliability also costs money, and adds complexity to systems – it’s easy to inadvertently make a “highly-available” configuration less reliable than a single server. We do a lot today to deliver service reliability that wasn’t part of our earlier architectures, and we can survive many problems with no externally visible signs.
Today, we run Last.fm from multiple separate datacentres, and we build resiliency and failover into all our new systems from the outset. We’re working hard on retrofitting this same level of reliability to all our older systems, though we still have some way to go before everything is where we’d like it to be.
The biggest problem we engineer for is the complete failure of an entire site. That’s a level of problem that we don’t expect to happen often, but we do plan for it and there are many aspects that need to be considered. It’s also the problem we effectively encountered this week, and for the most part everything went according to plan.
The system that failed was a large core router, which provides our cross site connectivity, and half of our internet connectivity. Its failure effectively isolated all the equipment in that datacentre, and caused us a lot of trouble. The system in question is equipped with fully redundant supervisor modules to prevent this sort of problem, but – for reasons that did not become apparent until later – the redundancy also failed.
We initially saw problems with this system a week ago, and carried out both a component swap out and reload of the software, which we thought had resolved the problem. When it failed again, our hardware service partners concluded we must be looking at a backplane fault, and shipped us a new chassis.
The backplane in this sort of system is essentially just a passive circuit board, so faults of this kind are most unusual. It wasn’t until we removed the old chassis that we discovered a large amount of grime covering its intake vents, which is not what you expect in a data centre with large air filtration and cooling systems.
It turns out that some of the air intake for this hosting facility is pulled in fresh off the roof, and the adjacent building houses the exhaust stack from the diesel generators used as back up power. In a suitably ironic fashion, the diesel exhaust was being pulled into the air conditioning system, depositing fine particulates on surfaces, including our hardware.
Where you have equipment with large fan assemblies, this problem is made worse, and the deposits can cause electrical problems, leading otherwise highly reliable equipment to mysteriously fail. The datacentre we use has only recently become aware of this problem, and is taking steps to resolve it, but in the meantime we’ve had to deal with the effects.
What problems did users see?
During these problems, users may have seen a couple of issues. The first and most visible of these will have been radio failures. Our radio infrastructure is cross site, but currently needs a careful manual failover process of some elements, so you may have been without radio for a period of time. Web site traffic and API traffic fails over automatically, so most people won’t have seen any issues with this. Some users will have though, as the cross site failover process is DNS based – this means it’s not instant, and ISPs that don’t correctly handle DNS timeouts can cause extended problems. This kind of thing seems to be most common amongst mobile providers.
Some of you will be concerned about your scrobbles – no scrobbles were lost during these issues. Client caching should ensure that any that didn’t make it to our servers will have been queued and resubmitted.
We’re sorry for any problems you may have seen while we worked on this behind the scenes. We’re constantly working on making the service better, and making these incidents a thing of the past. Thanks for listening!