Last.fm is all about bringing you music you love, and a big part of that is the technical infrastructure required to make it all happen. We take site and service reliability very seriously, which is why we get upset when things go wrong.
And this week things went wrong. It has been a particularly busy one for us, as we’ve had to deal with a large router failing on us which has unfortunately had user visible impact in a few places. As such, we want to share some of the back story, so you can understand what happened and how we’re working on making things better.
What the Ops team do
Last.fm began life as a small start-up some years ago now, and as is normal for technology based start-ups, reliability at a large scale wasn’t a big concern at the beginning. Start-ups focus on making things work, and worries about uptime and reliability come later. Reliability also costs money, and adds complexity to systems – it’s easy to inadvertently make a “highly-available” configuration less reliable than a single server. We do a lot today to deliver service reliability that wasn’t part of our earlier architectures, and we can survive many problems with no externally visible signs.
Today, we run Last.fm from multiple separate datacentres, and we build resiliency and failover into all our new systems from the outset. We’re working hard on retrofitting this same level of reliability to all our older systems, though we still have some way to go before everything is where we’d like it to be.
The biggest problem we engineer for is the complete failure of an entire site. That’s a level of problem that we don’t expect to happen often, but we do plan for it and there are many aspects that need to be considered. It’s also the problem we effectively encountered this week, and for the most part everything went according to plan.
What happened
The system that failed was a large core router, which provides our cross site connectivity, and half of our internet connectivity. Its failure effectively isolated all the equipment in that datacentre, and caused us a lot of trouble. The system in question is equipped with fully redundant supervisor modules to prevent this sort of problem, but – for reasons that did not become apparent until later – the redundancy also failed.
We initially saw problems with this system a week ago, and carried out both a component swap out and reload of the software, which we thought had resolved the problem. When it failed again, our hardware service partners concluded we must be looking at a backplane fault, and shipped us a new chassis.
The backplane in this sort of system is essentially just a passive circuit board, so faults of this kind are most unusual. It wasn’t until we removed the old chassis that we discovered a large amount of grime covering its intake vents, which is not what you expect in a data centre with large air filtration and cooling systems.
It turns out that some of the air intake for this hosting facility is pulled in fresh off the roof, and the adjacent building houses the exhaust stack from the diesel generators used as back up power. In a suitably ironic fashion, the diesel exhaust was being pulled into the air conditioning system, depositing fine particulates on surfaces, including our hardware.
Where you have equipment with large fan assemblies, this problem is made worse, and the deposits can cause electrical problems, leading otherwise highly reliable equipment to mysteriously fail. The datacentre we use has only recently become aware of this problem, and is taking steps to resolve it, but in the meantime we’ve had to deal with the effects.
What problems did users see?
During these problems, users may have seen a couple of issues. The first and most visible of these will have been radio failures. Our radio infrastructure is cross site, but currently needs a careful manual failover process of some elements, so you may have been without radio for a period of time. Web site traffic and API traffic fails over automatically, so most people won’t have seen any issues with this. Some users will have though, as the cross site failover process is DNS based – this means it’s not instant, and ISPs that don’t correctly handle DNS timeouts can cause extended problems. This kind of thing seems to be most common amongst mobile providers.
Some of you will be concerned about your scrobbles – no scrobbles were lost during these issues. Client caching should ensure that any that didn’t make it to our servers will have been queued and resubmitted.
We’re sorry for any problems you may have seen while we worked on this behind the scenes. We’re constantly working on making the service better, and making these incidents a thing of the past. Thanks for listening!
Comments
Jan Kuča
25 May, 15:52
Hi, I think you are having problems with e-mail delivery. My verification e-mail just won’t arrive. I already registered twice and tried like 7 different e-mail addresses.
Could you please investigate this? Thank you.
Jon Hallier
25 May, 16:14
Hi Jan,
The blog is probably not the best place for posting support issues.
Can you please contact us at: http://www.last.fm/help/support
Be sure to include your email address and the username you’re trying to register. Thanks.
Mass Dosage
25 May, 16:20
Suggested listening while reading this blog post – “Diesel Power” by The Prodigy.
Jan Kuča
25 May, 17:25
Yes, I’m aware of that. I already tweeted you a week ago and submitted this via the support contact form in your help section.
However, I cannot be sure that I’ll get an answer since the problem can be (and probably is) on your server and if you have a ticket system running on the server, any response you send is likely to end up undelivered.
I posted this here since this post addressed issues of this week.
Tom Allender
25 May, 20:53
Which datacentre had the problem?
hardcoreb0y
25 May, 21:18
Thank you for the explanation
Henk Poley
26 May, 05:39
Nobody noticed the oil smell?
Raj
27 May, 12:06
yeah technical stuff is really difficult to understand & that’s great that you all guys know what you are doing & fixed it
E-Clect-Eddy
27 May, 21:37
hehe, Just now, some scrobbles come in from 24th :-)
Rick
31 May, 16:33
Kudos to you for the candor on the cause and your work, and thanks to the Ops team for working hard to resolve it.
CrybKeeper
21 June, 12:47
Diesel generators? How often would those be running anyway? Okay, no sneaking up on the roof for a smoke, from now on guys, lol =)
eimkeith
22 June, 12:25
Yeah, I’m still down – June 22, 2011
Roger Witte
23 June, 17:27
Thanks for all your hard work – but you are in a pretty polluted environment even without the diesel generators; City Road carries one hell of a lot of traffic – it might be worth moving your data centres away from your offices. There is kind of a contradiction between the need to place the data centre away from the pollution of the city and the need to place it as close as possible to the net backbone (the square mile has the lowest net latency of any area in the UK).
Comments are closed for this entry.