UPDATE 25/01/11: After a weekend of work Last.fm is now back up to full capacity. If you’re continuing to experience problems then leave a comment below and let us know. Thank you for your patience.
As many of you will be aware, Last.fm has been experiencing an extended period of downtime in all user-facing services. A hardware failure has led to one of the most serious system outages we have experienced for a long time, and we are very sorry about the inconvenience caused to our listeners. At this moment everything should be on its way back to normal, but it could take some time for all services to return to a fully stable state.
We want to apologise for this outage, and explain the problems that have led to the difficulties you may be experiencing now.
Yesterday afternoon a fault in a blade chassis in one of our datacentres caused it to break, taking the power supply for its rack with it.
On site teams were unable resolve the fault with this chassis, but were able to restore power to the rest of the rack. Unfortunately, this chassis contained several critical components of the top-level load balancing systems we use to evenly distribute traffic across all of our datacentres.
Load balancing has been redistributed among the remaining centres. As these are running under a higher than usual load intermittent service outages may have resulted, leading to problems across all parts of Last.fm. As a result you might be experiencing difficulty with radio, the website or with your scrobbles.
Because some of our top level DNS services were impacted by this outage it has taken longer than usual for fresh DNS information to propagate. Many of you may have been presented with incorrect DNS information due to caching, despite the workarounds we have been putting in place. Eventually correct information will propagate in its place.
Last.fm’s Operations team has been working since it was first reported to provide listeners with short-term workarounds while the affected hardware is replaced or repaired. Our first priority has been to ensure that user-data is preserved wherever possible.
We’re doing everything we can to ensure that the site is operating for as many people as possible while the hardware is restored.
Some of you are worried about the status of your scrobbles; all scrobbles that are making it through are safe, and client caching should ensure that any that aren’t are queued and will be submitted correctly once the service is fully restored. These will appear as normal once the faults have been repaired.
We’d also like to apologise for not communicating more about the problem – as I’m sure you can appreciate our priority has been getting the issues fixed.