One paper that I read last Friday evening comes from Twitter Scaling Big Data Mining Infrastructure: The Twitter Experience, authored jointly by Jimmy Lin (@lintool on Twitter) and Dmitriy Ryaboy (@squarecog on Twitter). Jimmy come from academics and has worked with the Twiiter analytics infrastructure team on a sabbatical. Dmitriy is one of the managers of the analytics infrastructure team at Twitter.
The paper is a conglomerate of wisdom both from the academic world and the real world implementation aspects as implemented in Twitter. It discusses how the concepts of handling Big Data and machine learning that originated from the academics have been adapted to large scale industrial applications. Typical problems that we encounter while integrating a pipeline of heterogeneous components are discussed at length e.g. impedance mismatch. You can fork thousands of mappers as part of your map/reduce pipeline for importing data from sharded MySQL databases into Hadoop clusters. But in the practical world when operating at this large scale typically you face denial of service when mappers face starvation with critical resources like database connections. Hence, as the paper recommends "The underlying cause is the differing scalability characteristics of MySQL and Hadoop, which are not exposed at the interfaces provided by these services; additional layers for handling QOS, pushing back on over-eager clients (and handling such pushback), etc. become necessary".
The paper also discusses lots of other issues that you face when operating at the scale of Twitter. Challenges of log transport across data centers to a centralized Hadoop warehouse, importance of schema in log processing, data mining scalability when your data doesn’t fit into one server, drawbacks of using sampling based learning approaches and the virtues of deploying a really scalable machine learning pipeline are some of the topics that the paper discusses. Not all of us operate at that scale, but it’s always recommended to be aware of such concerns at the design and implementation level.
If you are working with Big Data and/or machine learning stuff, a very highly recommended reading ..