balance.fm

Friday, 2 March 2012
by Marcus
filed under Code and Announcements
Comments: 6

The open source tool balance is an essential part of the service infrastructure here at Last.fm. Multiple instances of balance are running on each and every web server node, on the various production back end servers, and also on our development machines. So at any given time there are probably thousands of instances running simultaneously on our machines.

What does it do?

balance is a so-called load balancer. It is generally used as a proxy to distribute a large number of incoming requests to a group of servers. In other words it is responsible for balancing the load between all the servers in a group. Quite often, load balancers are dedicated hardware products. However, balance is a software load balancer, which means it can just run as an additional program on any server.

In addition to load balancing, balance also supports a scheme called failover. This means you can define a second group of servers and balance will route requests to the second group if all servers in the first group fail. This failover scheme is used by most of our backend services at Last.fm. We usually have a main server and a backup server that kicks in once the main server fails.

End of story?

Certainly not! There are some subtleties in the use of balance that have given us headaches in the past. By far the biggest problem is that there are cases when failover just doesn’t work right in our environment. So here’s a real example…

One day we had to take down the main server for one of our backend services to replace a hard drive. The backup server was running fine and we relied on balance to take care of routing all requests through to the backup box. Unfortunately, shortly after the main server went down, we noticed that most requests to the service failed.

What had happened? balance has a configurable connect timeout, i.e. it tries to connect to a service and then waits for a certain amount of time until it figures out that it can’t connect. If the server machine is running, the connect will fail almost instantly if the service itself is unavailable. However, if the server is down, it’ll wait until the connect timeout has elapsed. So in our case, balance was trying to connect to the main server (which was down) and then waiting for 5 seconds before attempting to connect to the backup server. In the meantime, the client had already given up (it was using a much smaller timeout). balance would only notice that the client had given up by the time it had established the connection to the backup server. The next time the client tried to connect, the same thing would happen all over again.

But someone else would certainly have had the same problem before?

I’m quite sure of that. And I guess that’s what caused the autodisable feature to be added to balance. When this feature is being used, balance will automatically disable servers that it fails to connect to. The downside, though, is that there’s no way to automatically enable servers again. And manually enabling them isn’t really an option given the number of instances of balance we’re running and given that it could cause all servers to be permanently disabled in case of, for example, temporary network failure.

So what now?

We had to face the fact that in theory we had a really nice redundancy scheme, but it could fail quite miserably in practice. So I began to look around for alternatives to balance and found a couple of other open source load balancers. Sadly, all of them had either been abandoned by their authors, failed to build out of the box or just didn’t fulfill our requirements.

balance was actually just what we needed. The only thing it was missing was support for monitoring all back end connections and dynamically disabling and enabling them as they fail or pass the monitoring checks.

So eventually I started looking into adding exactly that functionality to balance.

balance.fm

Implementing monitoring for balance was relatively straightforward, even though it made me aware of how much I had gotten used to developing software in C++. With balance being written in pure C, I was really missing exception handling and the C++ standard library.

The amount of code changes was massive considering the rather small code base of balance. As of now, more than a thousand lines of code have changed and another thousand lines have been added. So we decided to fork the original project and rebrand it as balance.fm.

It took about a week to refactor the existing code and finally add the monitoring feature. Along the way of adding monitoring, quite a few bugs have been fixed as well (for details, just have a look at the commit log if you’re interested) and I hope these fixes make up for all the bugs that I’ve undoubtedly introduced by adding loads of new code.

The balance.fm code has since been reviewed by the MIR team here at Last.fm and is available from github.com/lastfm/balance.fm.

If you have an application for balance.fm, please give it a try and let us know what you think and like or dislike about it!

Comments

  1. Steff Davies
    2 March, 17:27

    polite ripple of applause from the Ops room

    Steff Davies – 2 March, 17:27
  2. Sid
    2 March, 22:31

    This is why I love open source. Thanks guys

    Sid – 2 March, 22:31
  3. Paul Sturgess
    3 March, 18:01

    Very interesting article. Have you guys considered setting up a chaos monkey?

    http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

    Paul Sturgess – 3 March, 18:01
  4. Andy
    14 March, 01:03

    The open source tool balance is an interesting service. I’m an italain boy.. i would to know if this part of the service infrastructure here at Last.fm will be localized in italian.. thanks :-)

    Andy – 14 March, 01:03
  5. shkaf kupe
    16 March, 04:28

    open source – cool

    shkaf kupe – 16 March, 04:28
  6. JoomLAB
    31 March, 20:54

    another best open source;) Thanks people, i love your job;)

    JoomLAB – 31 March, 20:54

Comments are closed for this entry.