Last week, Gilda Maurice (from our data team), Steve Whilton (from our product team) and myself went to Berlin for a couple of days. While Steve hurtled round Berlin from meeting to meeting — rather him than me — Gilda and I headed over to the Berlin Buzzwords conference at the Urania conference centre, just south of the famous Tiergarten.
It’s an annual meetup for engineers, scientists and other assorted hackers in the field of ‘big data’. The problems of processing and analysing the amount of data generated on the social web have required a whole new set of approaches, and we’re very keen on keeping up with new developments in this area, especially if they can help us make Last.fm better.
Two of the main open-source data tools we rely on at Last.HQ are Hadoop, a framework for parallel storage and querying of data on a cluster of servers, and Solr, a search engine based on the Lucene toolkit. Solr drives the search functions on the web site, and Hadoop does much of the behind-the-scenes number crunching, such as generating the weekly charts and calculating artist similarities. Lucene and Hadoop have both been very influential in this field, so it was fitting that the conference opened with a keynote from Doug Cutting who originated both projects.
In fact, Doug Cutting’s intro set the tone pretty well — probably half the talks were on Lucene or Hadoop, or other technologies that build on them. We learnt how to tune Solr performance and measure its relevance, how to improve its accuracy with a dash of linguistics, and how to visualize the topics within a given set of search results. Facebook and StumbleUpon presented their experiences of HBase, a Hadoop-backed database for storing and querying massive quantities of user data and content in real time, and JTeam took us through Mahout, a machine-learning toolkit for clustering and classification tasks, also based on Hadoop. A few of the talks went further into computer science theory, but always with a view to producing high-volume applications ready for web-scale data.
It’s hard to pick favourites out of such a dense line-up, but we particularly liked Joseph Turian‘s talk on new data-mining techniques (semantic hashing, graph parallelism and unsupervised semantic parsing), and Stanislaw Osinski‘s session on clustering and visualizing Solr search results with Carrot2, accompanied by a beautiful demo. Mark Miller and Rod Cope gave some sound advice on scaling Solr and HBase, and Chris Wensel took us through designing algorithms to manipulate and extracting data from Hadoop.
Sadly there was no way we could catch all the talks we wanted to see, with three rooms running in parallel each day, but thankfully all the talks were filmed — the slides are available here (apart from a few which are yet to appear), and the organisers will be making all the videos available soon.
Steve and Andy on a Berlin rooftop. Photo by Gilda.