Last.fm starts the summer early

Wednesday, 4 May 2011
by Helen Taylor
filed under About Us and Announcements
Comments: 3

Since 1995, Camden Crawl has established itself as the May Day Bank Holiday weekend’s hottest ticket, and even though it had competition from the Royal Wedding this year it was a great way of kicking our live series Last.fm Presents into gear for the summer season.

The baroque setting of Koko was the venue for our own stage, treating Camden to a cocktail of seven UK acts rising the Hype Charts. First up was Dinosaur Pile-Up, a band touched by the hand of grunge, whose track “My Rock ‘n’ Roll” proved a mission statement for the night. Lethal Bizzle is a star of Last.fm’s grime tag, and he embraced the spirit of Camden Crawl with a stage dive ahead of indie rockers Mazes, who’ve featured heavily in our Hype Chart over the past month.

British Sea Power transformed the Last.fm Presents stage into an ode to nature: foliage sprung up as footage of sea birds played. Epic favourites such as “Waving Flags” gained the biggest reception, before the sets took a turn for the electronic with the last two acts.

Simian Mobile Disco proved that knob-twiddling needn’t be a static affair, and the crowd agreed – four levels of Koko got down to “Audacity of Huge” and “Hustler”– while Hudson Mohawke excelled as last performer of the evening, his mix of r ‘n’ b vocals and groundshaking bass making for a warped electro trip into the night.

If you’re feeling like you missed out on a brilliant night, well, you sort of did. No fear though, Last.fm Presents have a packed festival season ahead this summer.

If you are heading down to (deep breath) The Great Escape, ATP, Liverpool Sound City, Get Loaded, Sonisphere, Rock Werchter, Truck, Field Day, Underage, Summer Sundae or SW4 then keep an eye out for our LFM lobbyists who’ll be ready to shower you with Last.fm goodies, including our tag stickers.

It’s going to be a great summer!

Live in Austin

Thursday, 10 March 2011
by Stefan Baumschlager
filed under Announcements and About Us
Comments: 11

Every spring the music industry descends upon the capital city of Texas to celebrate music in its many facets, genres & tags even. You see where this is going don’t you?

After a two year hiatus we’re bringing back the live SXSW tagging bonanza so that you can go nuts across town with those little red tag stickers. Here’s something to refresh your memory:

It’s simple really; whenever you see someone with sticker sheets in hand ask them to give you a couple so you can share them with your friends and start tagging the real world SXSW.

The fun doesn’t stop there of course! We encourage you to take pictures of your guerrilla tagging, upload your pics to flickr and tag them with ‘tagsxsw2011’ and ‘lastfm:event=1732494’ (that’s right we’re talking about flickr tags now – keep up!).

We’ve also updated the Band Aid group page so that you can easily find the bands you’d be crazy to miss this year! Enter your Last.fm username and you’re on your way.

If you want to could browse the full line up as well as your recommended line up just head to the SXSW 2011 Festival Page. Remember; the bands with the little burning flame icons next to them are the – yes – hot ones, who are destined for big big things in 2011 and beyond.

Finally we’ve got a little mission for you: SXSW has always tons of official showcases & shows, but equally there are a plethora of unofficial shows in someone’s backyard. If you happen to see that we’re missing bands you know are performing in some way shape or form at this year’s SXSW, please take 2 minutes to add them to the line up.

Thank you, and see you in Austin!

PS: if you want to get in touch while I’m out there, please do; follow @baumschlager on Twitter.

Launching Xbox, Part 2 - SSD Streaming

Monday, 14 December 2009
by Mike Brodbelt
filed under About Us and Tips and Tricks
Comments: 18

This is the second in a series of posts from the Last.fm engineering team covering the geekier aspects of our recent Last.fm on Xbox LIVE launch. Part one (“The War Room”) is here.

The music streaming architecture at Last.fm is an area of our infrastructure that has been scaling steadily for some time. The final stage of delivering streams to users fetches the raw mp3 data from a MogileFS distributed file system before passing it through our audio streaming software, which handles the actual audio serving. There are two main considerations with this streaming system: physical disk capacity, and raw IO throughput. The number of random IO operations a storage system can support has a big effect on how many users we can serve from it, so this number (IOPS) is a metric we’re very interested in. The disk capacity of the cluster has effectively ceased being a problem with the capacities available from newer SATA drives, so our biggest concern is having enough IO performance across the cluster to serve all our concurrent users. To put some numbers on this, a single 7200rpm SATA drive can produce enough IOPS to serve around 300 concurrent connections.

We’ve been using MogileFS for years at Last.fm, and it’s served us very well. As our content catalogue has grown, so has our userbase. As we’ve added storage to our streaming cluster, we’ve also been adding IO capacity in step with that, since each disk added into the streaming cluster brings with it more IOPS. From the early days, when our streaming machines were relatively small, we’ve moved up to systems built around the Supermicro SC846 chassis. These provide cost effective high-density storage, packing 24 3.5” SATA drives into 4U, and are ideal for growing our MogileFS pool.

Changing our approach

The arrival of Xbox users on the Last.fm scene pushed us to do some re-thinking on our approach to streaming. For the first time, we needed a way to scale up the IO capacity of our MogileFS cluster independently of the storage capacity. Xbox wasn’t going to bring us any more content, but was going to land a lot of new streaming users on our servers. So, enter SSDs…

Testing our first SSD based systems

We’d been looking at SSDs with interest for some time, as IO bottlenecks are common in any infrastructure dealing with large data volumes. We hadn’t deployed them in any live capacity before though, and this was an ideal opportunity to see whether the reality lived up to the marketing! Having looked at a number of SSD specs and read about many of the problems early adopters had encountered, we felt as though we were in a position to make an informed decision. So, earlier this year, we managed to get hold of some test kit to try out. Our test rig was an 8 core system with 2 X5570 CPUs and 12 Gb RAM (a SunFire X4170).

Into this, we put 2 hard disks for the OS, and 4 Intel X25-E SSDs.

We favoured the Intel SSDs because they’ve had fantastic reviews, and they were officially supported in the X4170. The X25-E drives advertise in excess of 35,000 read IOPS, so we were excited to see what it could do, and in testing, we weren’t disappointed. Each single SSD can support around 7000 concurrent listeners, and the serving capacity of the machine topped out at around 30,000 concurrent connections in it’s tested configuration – here it is half way through a test run (wider image here):

Spot which devices are the SSDs… (wider image here)

At that point its network was saturated, which was causing buffering and connection issues, so with 10GigE cards it might have been possible to push this configuration even higher. We tested both the 32Gb versions (which Sun have explicitly qualified with the X4170), and the 64Gb versions (which they haven’t). We ended up opting for the 64Gb versions, as we needed to be able to get enough content onto the SSDs for us to serve a good number of user requests, otherwise all that IO wasn’t going to do us any good. To get these performance figures, we had to tune the Linux scheduler defaults a bit:-

echo noop > /sys/block/sda/queue/scheduler
echo 32 > /sys/block/sda/queue/read_ahead_kb

This is set for each SSD – by default Linux uses scheduler algorithms that are optimised for hard drives, where each seek carries a penalty, so it’s worth reading extra data in while the drive head is in position. There’s basically zero seek penalty on an SSD, so those assumptions fall down.

Going into production

Once we were happy with our test results, we needed to put the new setup into production. Doing this involved some interesting changes to our systems. We extended MogileFS to understand the concept of “hot” nodes – storage nodes that are treated preferentially when servicing requests for files. We also implemented a “hot class” – when a file is put into this class, MogileFS will replicate it onto our SSD based nodes. This allows us to continually move our most popular content onto SSDs, effectively using them as a faster layer built on top of our main disk based storage pool.

We also needed to change the way MogileFS treats disk load. By default, it looks at the percentage utilisation figure from iostat, and tries to send requests to the most lightly-loaded disk with the requested content. This is another assumption that breaks down when you use SSDs, as they do not suffer from the same performance degradation under load that a hard drive does; a 95% utilised SSD can still respond many times faster than a 10% utilised hard drive. So, we extended the statistics that MogileFS retrieves from iostat to also include the wait time (await) and the service time (svctm) figures, so that we have better information about device performance.

Once those changes had been made, we were ready to go live. We used the same hardware as our final test configuration (SunFire X4170 with Intel X25-E SSDs), and we are now serving over 50% of our streaming from these machines, which have less than 10% of our total storage capacity. The graph below shows when we initially put these machines live.

You can see the SSD machines starting to take up load on the right of the graph – this was with a relatively small amount of initial seed content, so the offload from the main cluster was much smaller than we’ve since seen after filling the SSDs with even more popular tracks.

Conclusions

We all had great fun with this project, and built a new layer into our streaming infrastructure that will make it easy to scale upwards. We’ll be feeding our MogileFS patches back to the community, so that other MogileFS users can make use of them where appropriate and improve them further. Finally, thanks go to all the people who put effort into making this possible – all of crew at Last.HQ, particularly Jonty for all his work on extending MogileFS, and Laurie and Adrian for lots of work testing the streaming setup. Also thanks to Andy Williams and Nick Morgan at Sun Microsystems for getting us an evaluation system and answering lots of questions, and to Gareth Tucker and David Byrne at Intel for their help in getting us the SSDs in time.

Launching Xbox, Part 1 - The War Room

Monday, 7 December 2009
by Laurie Denness
filed under About Us and Tips and Tricks
Comments: 16

As many of you noticed, a few weeks ago we launched Last.fm on Xbox LIVE in the US and UK. It probably goes without saying that this project was a big operation for us, taking up a large part of the team’s time over the last few months. Now that the dust has settled, we thought we’d write a short series of blog posts about how we prepared for the launch and some of the tech changes we made to ensure that it all went smoothly.

0 Hour: Monitoring.

First up, let me introduce myself. My name is Laurie and I’ve been a Sysadmin here at Last.fm for almost two and a half years now. As well as doing the usual sysadmin tasks (turning things off and on again) I also look after our monitoring systems, including a healthy helping of Cacti, a truck of Nagios and a bucket-load of Ganglia. Some say I see mountains in graphs. Others say my graphs are infact whales. But however you look at it, I’m a strong believer in “if it moves, graph it”.

To help with our day-to-day monitoring we use four overhead screens in our operations room, with a frontend for Cacti (CactiView) and Nagios (Naglite2) that I put together. This works great for our small room, but we wanted something altogether more impressive — and more importantly, useful — for the Xbox launch.

At Last.HQ we’re big fans of impressive launches. Not a week goes by without us watching some kind of launch, be it the Large Hadron Collider, or one of the numerous NASA space launches.

We put a plan into action late on Monday evening (the night before launch), and it quickly turned into a “How many monitors can you fit into a room” game. In the end though, being able to see as many metrics as possible became useful.

So, ladies and gentlemen…

Welcome to the war room

Every spare 24” monitor in the office, two projectors, a few PCs and an awesome projector clock for a true “war room” style display (and to indicate food time).

Put it together and this is what you get:


Coupled with a quickly thrown together Last.fm style Nasa logo (courtesy our favourite designer), we were done. And this is where we spent 22 hours on the day of the launch, staring at the graphs, maps, alerts, twitter feeds.. you name it, we had it.

It was pretty exciting to sit and watch the graphs climb higher and higher, and watch the twists and turns as entire areas of the world woke up, went to work, came back from work (or school) and went to sleep. We had conference calls with Microsoft to make sure everything was running smoothly and share the latest exciting stats. (Half a million new users signed up to Last.fm through their Xbox consoles in the first 24 hours!)

As well as the more conventional style graphs, we also had some fun putting together some live numbers to keep up to speed on things in a more real time fashion. This was a simple combination of a shell script full of wizardry to get the raw number, then piped through the unix tools “figlet” (which makes “bubble art” from standard text) and “cowsay” (produces an ASCII version of a cow with a speech bubble saying whatever you please).

Looking after Last.fm on a daily basis is a fun task with plenty of interesting challenges. But when you’ve spent weeks of 12-hour days and working all weekend, it really pays to sit back in a room with all your co-workers (and good friends!) and watch people enjoy it. Your feedback has been overwhelming, and where would we have been without Twitter to tell us what you thought in real time?

Coming Next Time

We had to make several architectural changes to our systems to support this launch, from improved caching layers to modifying the layout of our entire network. Watch this space soon for the story of how SSDs saved Xbox…

Tales of a data team intern

Friday, 26 June 2009
by Fredrik Möllerstrand
filed under About Us
Comments: 13

For the last six months, I have been a part of Last.fm’s data team while writing my master thesis together with Per – and lived to tell the tale.

Presently, I’m sitting in a quiet room where the heat makes more noise than do the people in it. See, music is a big thing here and the hackers on my team mostly prefer to code with their ears hugged by earphones. It wasn’t always this hot, though.

Back in January, Per and I landed in a chilly London where the people wore hats and greeted one another in charming ways. We were on a mission: write a data store that serves as back-end to an in-house visualization tool that renders graphs out of interesting numbers: scrobbles per king of pop, subscription signups broken down by country, that sort of thing.

For someone who digs distributed systems and Big Data, Last.fm is heaven. There is a sizable Hadoop cluster and many more terabytes of data than one can comfortably fathom. At the end of our tenure – and this is the best part – we were to release the code as open source. Sure enough, we had landed our dream gig.

The offices have got that typical Shoreditch media/tech post-startup vibe going: there are copies of The Economist and Wired in the foyer, a flipped-over skateboard lies next to the umbrella rack. The company occupies a whole floor which is split into two sides: the bizniz lot occupies one end (this is where the fax machine is) and the dev teams & ops (nerf guns at the ready) rule the other end. Working hours are generally between ten and seven, so coming in at eight o’clock sharp on your first day is definitely advised against. Not that anyone would.

As those who went before me have noted, Last.fm is a rather awesome place to work. There are some seriously brilliant heads here. Not only is the staff enthusiastic and contagiously dedicated, they also take a large pride in their work and that’s key to producing quality stuff. With passion it’s true.

Per and I were given an insane amount of freedom in implementing our data store. While we did get all the help we needed, both from our closest collaborators as well as anyone else around the office who we harassed with questions, all decisions regarding the project and its execution where ours to make. At the end of the day, we were ourselves fully responsible for our own fortunes and I think it is only from that kind of freedom and trust that truly brilliant things can come. And how did we fare then? Quite well, thank you. Zohmg is out there, and although it may not change the world just this year it might make the life of a data analyst or two a tad easier.

All in all it has been a killer internship experience: I got to present at HUGUK, became involved with HBase and met people who have instilled inspiration in me that will last a long time. I have realized the benefits of working alongside incredibly passionate fellows who are committed to perfect their trade. It will be hard to settle for anything less in the future.

"Techcrunch are full of shit"

Monday, 23 February 2009
by Richard Jones
filed under Announcements and About Us
Comments: 190

On Friday night a technology blog called Techcrunch posted a vicious and completely false rumour about us: that Last.fm handed data to the RIAA so they could track who’s been listening to the “leaked” U2 album.

I denied it vehemently on the Techcrunch article, as did several other Last.fm staffers. We denied it in the Last.fm forums, on twitter, via email – basically we denied it to anyone that would listen, and now we’re denying it on our blog.

According to Ars Technica, even the RIAA don’t know where the rumour came from. The Ars Technica article is worth a read by the way, as it explains how the album was leaked in the first place by U2’s record label.

All the data and technical side of Last.fm is hosted in London and run by the team here. We keep a close eye on what data mining jobs we run, not because we’re paranoid the RIAA is trying to infiltrate us, but because time on our Hadoop Cluster (where the data lives) is so precious and we have lots of important jobs that run every day. It’s simply impossible for anyone to run a job without the team here noticing.

When you signup to Last.fm and scrobble what you listen to, you are trusting us with your listening data. We take this very seriously. The old-timers on Last.fm who’ve been with us since the early days can attest to this – we’ve always been very open and transparent about how your data is used. This hasn’t changed. We never share personally identifiable data such as email and IP addresses. The only type of data we make available to labels and artists, other than what you see on the site, is aggregate data of listeners and number of plays.

Artists and labels can login to our MusicManager site to upload new content and update their catalogue. The MusicManager is also where artists and labels can see statistics on how popular their content is with Last.fm users.

If you were U2’s record label and logged in to the MusicManager today, you would see this:





…and you could pat yourself on the back for a successful album launch. All the controversy and press coverage surrounding the leaked release caused an obvious spike in the number of people listening to U2 recently.

So do us a favour – if you see people spreading the rumour, refer them to this blog post and mention you heard from a friend that “Techcrunch are full of shit.”

The Internship Experience (tm)

Monday, 13 October 2008
by Tim Sell
filed under About Us
Comments: 10

I recently finished a 3 month internship for Last.fm in the data junkie department. Officially I was a Java intern. That means turning coffee into code. In between thoughts of “why am in a country where it’s cold even during summer?!”, I spent time thinking “how the hell did I end up here doing stuff I like?”

I’ve been instructed to write something about it. It all started on a dark and stormy night in 1981… I’ll just skip to the end. On my first day I was given a computer and told to install whatever I wanted on it. Ubuntu away!! I was very happy.

After that moment of glee however, it wasn’t all beer and ball pits. I also had to learn stuff.

Luckily I was given interesting, challenging problems and the trust to work on them with my own brain. It’s perfect for someone excited about all things open source as Last.fm uses open source tools everywhere. That helps give a warm fuzzy feeling throughout the day, even more so when I was able to make some contributions back. I submitted a couple of patches to HBase, the open source Bigtable-like implementation.

There was much digesting to be done. Researching and playing with things like Map/Reduce, Hadoop and HBase was a great way to learn about the future of scalability. It’s safe to say I learned far more than I would have crammed into lectures.

Of course, Last.fm is more than just a great learning environment. There’s also being a part of producing something that people really love. People care a lot about what they do at Last.fm and it’s infectious… beware.

I’d like to note that there are also certain things that have conspired against me during my internship. Good places in the area to go for lunch made me spend way too much money, a pretty decent coffee machine forced me to make many typos, and getting distracted discovering new music on the Last.fm website caused me to forget what I was thinking more than once.

All in all, the whole things was a positive experience that ended in a hug.

Quality Control

Friday, 1 August 2008
by Adrian Woodhead
filed under About Us
Comments: 51

[Suggested listening while reading this post: Quality Control – Jurassic 5]

Prior to moving to London to join Last.fm I worked on credit card software for a leading international bank. When it comes to dealing with people’s money there isn’t much room for mistakes and buggy code can have major consequences. For these reasons there were a number of processes and systems in place to reduce the likelihood of software errors.

Despite what some of our more critical users may think, we do actually have a number of similar systems (and some novel additions) in place at Last.fm. We use software like Cacti, Ganglia, Nagios and JMX to monitor many aspects of our running infrastructure and the results are made available in a number of ways – from coloured graphs to arcanely-formatted log files. So much information is churned out that one could easily spend all day just looking at all the output until one’s mind buckled under the data overload. For this reason we selectively take the most vital data (things like database load, web request times, uptime status of core machines) and display these on eye-catching displays in our operations room.

Status display screens.

The setup shown above is great for being able to look up and get a quick feel for the current state of our systems. Blinking red and graphs with huge spikes are rarely a good thing. In addition to these displays we also have a number of alerts (e-mail, sms, irccat) that get triggered if things go wrong while we are away from the screens (yes, it does happen). There is nothing quite like the joy of being woken up in the early hours of the morning with a barrage of text messages containing the details of each and every machine that has unexpectedly crashed.

While all of this is very useful for keeping an eye on the code while it is running, it’s also good to be able to put the code through some checks and balances before we unleash it on the wider world. One means to this end is the venerable Hudson – a continuous integration engine that constantly builds our software, checks it for style and common coding errors, then instruments and tests it and reports on any violations that maybe have been introduced since the last time it ran. We have over 30 internal projects that use Hudson and a few thousand tests which run over the code. Hudson comes with a web interface and can be configured to send email when people “break the build” (e.g. by making a change that causes a test to fail). We decided that this wasn’t nearly humiliating enough and followed this suggestion (our setup pictured below) to introduce a more public form of punishment.

The bears that haunt our developer’s nightmares.

These 3 bears sit in a prominent position and watch our developer’s every move. When things are good we have a green bear gently glowing and purring, when changes are being processed a yellow bear joins the party, and if the build gets broken the growling evil red bear makes an appearance. The developer who broke things usually goes a similar shade of red while frantically trying to fix whatever was broken while the others chortle in the background.

Amid all this hi-tech digital trickery, it is sometimes nice to be able to cast one’s mind back to the simpler analogue age and the measuring devices of the past. For example, we hooked up an analogue meter like those used in many industries for decades, fed it some different input and ended up with a literal desktop dashboard that measures average website response time.

Web response time meter.

It is strangely mesmerising to see this meter rev up and down as website demand changes over the day (or we manage to overload our data centre’s power supply and a significant portion of our web farm gets to take an unexpected break from service).

On the whole we have a great variety of options for keeping our eyes on the quality prize, thanks in no small measure to the efforts of the open source software community who crafted all the software I have mentioned. Of course the biggest challenge to ensuring quality is still the human component – getting people to actually use these tools and instilling the desire and motivation to make software as bug-free as possible. If any of you out there use similar tools that you are passionate about let us know. I’d also love to hear if anyone has any other amusing or original systems to keep quality control fun and fresh. For me, I’ve got a glowing green bear to keep company….

Java Summer Interns

Wednesday, 9 April 2008
by Johan Oskarsson
filed under Code and About Us
Comments: 24

Update
We have now filled all the slots for the internship. Thanks to everyone who applied!

Our lovely back end team is looking for fresh meat. Specifically Java programming students that want to work with huge datasets and clusters of yellow elephants. So if you are interested in hanging out in East London and getting your hands dirty hacking some of the most exciting music-related software on the ‘net, then read on for more information on our summer internship programme.

You’ll be spending your days bathing in the ball pit and getting back rubs from our support department. Occasionally we’ll ask you to do some work too, for example developing new features for the open source projects below or improving our own internal systems. You will be mentored by experienced Last fm developers and stand to learn loads about making the software that is used daily by millions of people across the globe.

Potential areas to work in:

  • The open source distributed data processing project Apache Hadoop.
  • The distributed data storage project Apache HBase.
  • Processing our listening data into interesting funky stuff we can use on the site.
  • Improving the Java support in (soon to be Apache) Thrift.
  • Using X-Trace to find and resolve bottlenecks in our distributed systems.
  • We’re also open for student suggestions! If it’s useful and/or cool, include any ideas you may have in your application.

If you’re allowed to work in the UK please e-mail your applications including a CV to jobs@last.fm, put “java summer intern” in the subject line. The internship will last for the coming summer (we are quite flexible with the exact dates). If things go really well, you might even end up joining our team when your studies are over.

More information on our jobs page.

The Last fm intern experience

Hadoop Summit 2008: Creating new Infrastructures for Big Data

Tuesday, 1 April 2008
by Martin Dittus
filed under About Us and Code
Comments: 15

Last week Johan and I were fortunate to be sent to the Hadoop Summit 2008, the first conference focused on the Hadoop open source project and its surrounding ecosystem. Great talks, conversations with a lot of interesting and talented people, and loads of food for thought.

Big Data Startups

Last.fm is a prominent representative of a growing class of Internet startups: smaller companies whose business entails storing and processing huge amounts of data with a low amount of overhead. Our project teams are comparably tiny and we rely mostly on open source infrastructure; and while the Big Data problems of big corporations have been catered to for decades (in the form of an entire industry of integrated infrastructures, expensive hardware, consulting fees, training seminars, …) we’re not really in that market. As a result we often have to create our own, low-overhead solutions.

Throughout the Hadoop Summit it was very apparent that we’re seeing the dawn of a new culture of data teams inside Internet startups and corporations, people who manage larger and larger data sets and want better mechanisms for storage, offline processing and analysis. Many are unhappy with existing solutions; because they solve the wrong problems, are too expensive, or based on unsuitable infrastructure designs.

The ideal computing model in this context is a distributed architecture: if your current system is at its limits you can just add more machines. Google was one of the first Internet companies to not only recognise that, but to find great solutions; their papers on MapReduce (for distributed data processing) and their Google File System (for distributed storage) have had great impact on how modern data processing systems are built.

Doug Cutting’s Hadoop project incorporates core principles from both of them, and on top of that is very easy to develop for. It’s a system for the offline processing of massive amounts of data; where you write data once, then never change it, where you don’t care about transactions or latency, but want to be able to scale up easily and cheaply.

Distributed Computing, Structured Data and Higher Levels of Abstraction

One current trend we’re seeing within the Hadoop community is the emergence of MapReduce processing frameworks on a higher level of abstraction; these usually incorporate a unified model to manage schemas/data structures, and data flow query languages that often bear a striking resemblance to SQL. But they’re not trying to imitate relational databases — as mentioned above, nobody is interested in transactions or low latency. These are offline processing systems.

My personal favourite among these projects is Facebook’s Hive, which could be described as their approach to a data warehousing model on top of MapReduce; according to the developers it will see an open source release this year. Then there’s Pig, Jaql, and others; and Microsoft’s research project Dryad, which implements an execution engine for distributed processing systems that self-optimises by transforming a data flow graph, and that integrates nicely with their existing (commercial and closed-source) development infrastructure.

Another increasingly prominent project is HBase, a distributed cell store for structured data, which in turn implements another Google paper (”Bigtable“). HBase uses Hadoop’s distributed file system for storage, and we’re already evaluating it for use inside the Last.fm infrastructure.

But despite all this activity it’s still very apparent that this is a young field, and there are at least as many unsolved problems as there are solved ones. This is only a faint indicator of what’s to come…

If You Just Got Interested

Are you a software developer with a background in distributed computing, large databases, statistics, or data warehousing? Want to gain first-hand experience inside an emerging industry? Then apply for our Java Developer and Data Warehouse Developer positions!