Real-Time Analytics and Ad Hoc Queries: What Hadoop Can and Can’t Do for You
I recently ran across an article in TechTarget that talks extensively about Hadoop’s limitations around real-time analytics applications. The article, authored by Ed Burns, emphasizes that while Hadoop was designed to process large sets of structured, unstructured and semi-structured data, it was built as a batch processing system which imposes significant limitations around real-time analysis. In the article Burns features excerpts from an interview with Forrester analyst Mike Gualtieri who mentions that there are plenty of vendors and end users asking, “Why can’t we execute real-time data analytics and ad hoc queries using Hadoop?” It’s a valid question, and Mike cites a key obstacle Hadoop faces with respect to real-time analytics.
Mike states that most of the new Hadoop query engines remain slower and more cumbersome than queries posed against mainstream relational databases. Various tools include interfaces that allow users to write queries in the SQL programming language that in turn get translated via MapReduce for execution on a Hadoop cluster.
While Hadoop’s scalability and affordability are appealing to some organizations, it’s important to recognize its place in the market. And if real-time analytics and ad hoc query capabilities are important, experts agree it’s better not to cobble it into a systems architecture where it doesn’t fit or make sense.
Nonetheless, Hadoop vendors are complementing their batch processing capabilities by partnering with stream processing technology providers. Vitria is an example of one such vendor that can continuously process streaming data in real-time. By contrast, because Hadoop is a batch processing system, it’s a piecemeal approach requiring multiple technologies to achieve streaming data analytics capabilities.
If effectively managing your business requires immediate and continuous data analysis – down to the fraction of a second – batch processing just won’t cut the mustard. Companies that operate within the retail, energy, financial services or telecom industries, for example, need this level and speed of analysis. And while the value of continuous real-time data analytics is widely recognized within these industries, it’s critically important to make the distinction between “continuous, real-time” and “on-demand, near real-time” and understand the potential pitfalls and drawbacks associated with Hadoop-based data analytics solutions.