process management blog posts

How TIBCO Leverages Big Data Analytics with Apache Hadoop and Apache Spark

Blog: The Tibco Blog

Big Data is not the hype any more. Most of the customers and prospects that I have visited last year already use Hadoop, at least in early stages. Apache Spark also got a lot of traction in 2015. These frameworks and their ecosystems will probably grow even more in 2016, getting more mature and prevalent in big and smaller enterprises.Both Apache Hadoop and Apache Spark can be combined with TIBCO software to add business value to the projects of our customers. Therefore, I thought its time to give an overview about how the different TIBCO pillars—integration, event processing and analytics—support these frameworks in the beginning of 2016. This blog post is intended as short overview, and will not going into many technical details.

Integration and Orchestration with Hadoop (MapReduce, HDFS, HBase, Hive)

The key challenge is to integrate the input and results of Hadoop processing into the rest of the enterprise. Just using a Hadoop distribution (Hortonworks, Cloudera, MapR, et al.) requires a lot of complex coding for integration services. TIBCO ActiveMatrix BusinessWorks offers a big data plugin to integrate Hadoop in both directions—input and output—without coding, and supporting technologies such as MapReduce, HDFS, HBase or Hive. More details are available in this specific blog post: TIBCO ActiveMatrix BusinessWorks 6 + Apache Hadoop = Big Data Integration.

Another relevant topic is orchestration of Hadoop jobs. Frameworks such as Apache Oozie or Apache Nifi are available to schedule Hadoop workflows. For example, Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. These frameworks add more complexity and required a lot of coding and configuration. TIBCO ActiveMatrix BusinessWorks is a nice alternative to implement these workflow schedules with powerful orchestration features and without coding.

Business Intelligence, Data Discovery, and Reporting with Hadoop and Spark

TIBCO Spotfire—for data discovery and advanced analytics—has certified connectors to all important Hadoop and Spark interfaces such as HDFS, Hive, Impala, or SparkSQL. Just enter the connection data for the cluster (e.g. IP address, user, password) in the Spotfire user interface and start analyzing the data stored on the big data cluster. You can either load data in-memory for further analysis, or do “in-database analytics” directly in the cluster.

The connectors also support relevant security requirements. For example, the connector for Impala, an analytic MPP database for Apache Hadoop, is certified on Cloudera’s CDH5 and includes support for security via Kerberos, SSL, or username / password.

TIBCO Jaspersoft—for pixel-perfect reporting embedded into any browser, application, or mobile device—also offers connectors to all relevant Hadoop and Spark interfaces.

Streaming Analytics with Flume, Kafka, MQTT, SparkSQL

Streaming analytics gets more and more important to process big data in real time. Therefore, TIBCO StreamBase offers connectors to many frameworks with high messaging volume, such as the Hadoop-related Apache Flume, which is usually used for efficiently collecting, aggregating, and moving large amounts of data from or to Hadoop. Messaging solutions such as TIBCO FTL and broker for standards such as MQTT or AMQP are also supported. TIBCO StreamBase can be used in combination with any different source to implement efficient filtering, aggregations, correlations, and advanced analytics with easy-to-use and mature tooling. By the way: A StreamBase connector for Apache Kafka (a general purpose publish-subscribe model messaging system) will also be available soon.

In addition to stream processing, you can also interact with analytic database in real time. This blog post explains in more detail how to connect to the analytic engine Impala via its SQL interface. SparkSQL (Spark’s module for working with structured data, either within Spark programs or through standard JDBC/ODBC connectors), Apache Phoenix (a relational database layer over HBase), and other frameworks can be used in the same way, of course.

TIBCO Live Datamart can be used on top of automatic stream processing to allow operational analytics and proactive human interaction with real-time data. This enables real-time visibility into what’s happening in a Hadoop or Spark cluster.

Machine Learning with R, Hadoop and SparkR

The R language being used more frequently by data scientists, especially for use cases such as predictive analytics or recommendation and optimization scenarios. TERR (TIBCO Runtime for R) can be used to bring R from academics to the enterprise level with high scalability.

What many of you might not be aware of, is that TERR can also run on top of Hadoop or Spark clusters and leverage their analytic power. This whitepaper explains the different options of running TERR code in combination with Hadoop and Spark clusters. You can leverage these kind of TERR analytics with both historical analytics to find new insights using TIBCO Spotfire, and real-time stream processing using TIBCO StreamBase. Under the hood, TERR uses the Hadoop Streaming Interface, respectively SparkR to achieve this.

Hadoop and Spark are Everywhere at TIBCO

Hadoop and Spark are two of the most relevant frameworks for big data analytics these days. The ecosystem is growing at an unbelievable pace. TIBCO software leverages these ecosystems often to add business value in all three pillars: integration, event processing, and analytics. TIBCO in conjunction with Hadoop and Spark provides benefits such as big data analytics without low-level knowledge of the underlying frameworks and faster, more frequent implementations or deployments.