Last week Johan and I were fortunate to be sent to the Hadoop Summit 2008, the first conference focused on the Hadoop open source project and its surrounding ecosystem. Great talks, conversations with a lot of interesting and talented people, and loads of food for thought.
Big Data Startups
Last.fm is a prominent representative of a growing class of Internet startups: smaller companies whose business entails storing and processing huge amounts of data with a low amount of overhead. Our project teams are comparably tiny and we rely mostly on open source infrastructure; and while the Big Data problems of big corporations have been catered to for decades (in the form of an entire industry of integrated infrastructures, expensive hardware, consulting fees, training seminars, …) we’re not really in that market. As a result we often have to create our own, low-overhead solutions.
Throughout the Hadoop Summit it was very apparent that we’re seeing the dawn of a new culture of data teams inside Internet startups and corporations, people who manage larger and larger data sets and want better mechanisms for storage, offline processing and analysis. Many are unhappy with existing solutions; because they solve the wrong problems, are too expensive, or based on unsuitable infrastructure designs.
The ideal computing model in this context is a distributed architecture: if your current system is at its limits you can just add more machines. Google was one of the first Internet companies to not only recognise that, but to find great solutions; their papers on MapReduce (for distributed data processing) and their Google File System (for distributed storage) have had great impact on how modern data processing systems are built.
Doug Cutting’s Hadoop project incorporates core principles from both of them, and on top of that is very easy to develop for. It’s a system for the offline processing of massive amounts of data; where you write data once, then never change it, where you don’t care about transactions or latency, but want to be able to scale up easily and cheaply.
Distributed Computing, Structured Data and Higher Levels of Abstraction
One current trend we’re seeing within the Hadoop community is the emergence of MapReduce processing frameworks on a higher level of abstraction; these usually incorporate a unified model to manage schemas/data structures, and data flow query languages that often bear a striking resemblance to SQL. But they’re not trying to imitate relational databases — as mentioned above, nobody is interested in transactions or low latency. These are offline processing systems.
My personal favourite among these projects is Facebook’s Hive, which could be described as their approach to a data warehousing model on top of MapReduce; according to the developers it will see an open source release this year. Then there’s Pig, Jaql, and others; and Microsoft’s research project Dryad, which implements an execution engine for distributed processing systems that self-optimises by transforming a data flow graph, and that integrates nicely with their existing (commercial and closed-source) development infrastructure.
Another increasingly prominent project is HBase, a distributed cell store for structured data, which in turn implements another Google paper (”Bigtable“). HBase uses Hadoop’s distributed file system for storage, and we’re already evaluating it for use inside the Last.fm infrastructure.
But despite all this activity it’s still very apparent that this is a young field, and there are at least as many unsolved problems as there are solved ones. This is only a faint indicator of what’s to come…
If You Just Got Interested
Are you a software developer with a background in distributed computing, large databases, statistics, or data warehousing? Want to gain first-hand experience inside an emerging industry? Then apply for our Java Developer and Data Warehouse Developer positions!