Last week Johan and I were fortunate to be sent to the Hadoop Summit 2008, the first conference focused on the Hadoop open source project and its surrounding ecosystem. Great talks, conversations with a lot of interesting and talented people, and loads of food for thought.
Big Data Startups
Last.fm is a prominent representative of a growing class of Internet startups: smaller companies whose business entails storing and processing huge amounts of data with a low amount of overhead. Our project teams are comparably tiny and we rely mostly on open source infrastructure; and while the Big Data problems of big corporations have been catered to for decades (in the form of an entire industry of integrated infrastructures, expensive hardware, consulting fees, training seminars, …) we’re not really in that market. As a result we often have to create our own, low-overhead solutions.
Throughout the Hadoop Summit it was very apparent that we’re seeing the dawn of a new culture of data teams inside Internet startups and corporations, people who manage larger and larger data sets and want better mechanisms for storage, offline processing and analysis. Many are unhappy with existing solutions; because they solve the wrong problems, are too expensive, or based on unsuitable infrastructure designs.
The ideal computing model in this context is a distributed architecture: if your current system is at its limits you can just add more machines. Google was one of the first Internet companies to not only recognise that, but to find great solutions; their papers on MapReduce (for distributed data processing) and their Google File System (for distributed storage) have had great impact on how modern data processing systems are built.
Doug Cutting’s Hadoop project incorporates core principles from both of them, and on top of that is very easy to develop for. It’s a system for the offline processing of massive amounts of data; where you write data once, then never change it, where you don’t care about transactions or latency, but want to be able to scale up easily and cheaply.
Distributed Computing, Structured Data and Higher Levels of Abstraction
One current trend we’re seeing within the Hadoop community is the emergence of MapReduce processing frameworks on a higher level of abstraction; these usually incorporate a unified model to manage schemas/data structures, and data flow query languages that often bear a striking resemblance to SQL. But they’re not trying to imitate relational databases — as mentioned above, nobody is interested in transactions or low latency. These are offline processing systems.
My personal favourite among these projects is Facebook’s Hive, which could be described as their approach to a data warehousing model on top of MapReduce; according to the developers it will see an open source release this year. Then there’s Pig, Jaql, and others; and Microsoft’s research project Dryad, which implements an execution engine for distributed processing systems that self-optimises by transforming a data flow graph, and that integrates nicely with their existing (commercial and closed-source) development infrastructure.
Another increasingly prominent project is HBase, a distributed cell store for structured data, which in turn implements another Google paper (”Bigtable“). HBase uses Hadoop’s distributed file system for storage, and we’re already evaluating it for use inside the Last.fm infrastructure.
But despite all this activity it’s still very apparent that this is a young field, and there are at least as many unsolved problems as there are solved ones. This is only a faint indicator of what’s to come…
If You Just Got Interested
Are you a software developer with a background in distributed computing, large databases, statistics, or data warehousing? Want to gain first-hand experience inside an emerging industry? Then apply for our Java Developer and Data Warehouse Developer positions!
Comments
Luke
1 April, 22:30
I just got interested! Will you wait for me until January 2010, when my current contract expires? :)
Kimiko
1 April, 23:15
But Last.fm also needs to read back the scrobbled data, right, to calculate relationships between users and artists and tracks and tags? Doesn’t latency matter then?
mkb
2 April, 01:53
Oh, if only you weren’t an ocean away!
Mal
2 April, 18:31
Must be a nightmare having to deal with last.fm’s data. I’ve had to denormalise databases to help them scale, but this field of work seems to be on a completely different, almost scary, level.
Rob Szarka
2 April, 19:33
Great post! It is, indeed, an exciting time for this kind of development… I’d love to hear your thoughts on how Amazon’s AWS measures up, or fails to measure up, to your requirements.
Adrian
3 April, 17:59
Kimiko – we try do that using “batch” processes which run on the data in offline mode and then feed the results of these calculations back into the system. Querying the data in realtime would indeed require low latency, something the HBase product that martind mentioned doesn’t seem to provide yet (but it’s early days for HBase still).
Mal – “almost” scary? There’s no almost about it! ;)
Fowl2
5 April, 14:29
“scary”
<a href=“http://www.last.fm/music/Alex+Gaudino+Feat.+Crystal+Waters”>example of non normalized data</a>
It’s a mess, but you guys are working on it!
If only I was a little bit older, an ocean closer, and slightly more than slightly more computer sciencey….
:-)
gwalla
5 April, 22:51
Can’t comment much on the technical matters (I’m pretty tech-savvy, but that’s way beyond me). But I’m curious about that photo: what were you guys doing at Yerba Buena Gardens in San Francisco? I work just a couple of blocks from there. And the Hadoop site claims the summit was in Santa Clara.
Martin Dittus
6 April, 10:54
gwalla —
ah that’s what that place was called, thanks!
This was the last day of our trip, we visited some startups in the city before we left.
maestro
7 April, 23:23
this has nothin to do with the blog…but i’m wondering…
When is last.fm going to come out with the lightweight adobe air software? :D :D :D
Azalia Decart
9 April, 00:56
I see your point.
You think that the existing data storage solutions are too expensive or too outdated.
Most of the current solutions are too monolithic in my opinion. A more modular approach with lower access latency would be a dream of any data digger.
Karthikeyan R
12 April, 02:16
you do realise that hadoop has been almost a pet project for yahoo! right? i mean, doug is himself with yahoo!
ashibuogwu kenneth
18 April, 10:36
i want to know more about your stuffs
thank you
Gruce
19 April, 09:52
Oh, I want to know more and would appreciate if am selected for this job role.
Jon
9 May, 21:04
It´s cool that you can press T to add tags but its a real pain in browser tab enabled browsers, if you know the know what I mean.
sorry don¨t have time to put this the correct place
cheers jon – keep up the good work
Comments are closed for this entry.