Machine Learning Analytics with Zeppelin

Apache Zeppelin is an interactive computational environment built on Apache Spark like the IPython Notebook. With Apache PredictionIO and Spark SQL, you can easily analyze your collected events when you are developing or tuning your engine.

Prerequisites

The following instructions assume that you have the command sbt accessible in your shell's search path. Alternatively, you can use the sbt command that comes with Apache PredictionIO at $PIO_HOME/sbt/sbt.

Export Events to Apache Parquet

PredictionIO supports exporting your events to Apache Parquet, a columnar storage format that allows you to query quickly.

Let's export the data we imported in Recommendation Engine Template Quick Start, and assume the App ID is 1.

$ $PIO_HOME/bin/pio export --appid 1 --output /tmp/movies --format parquet
 

After the command has finished successfully, you should see something similar to the following.

root
 |-- creationTime: string (nullable = true)
 |-- entityId: string (nullable = true)
 |-- entityType: string (nullable = true)
 |-- event: string (nullable = true)
 |-- eventId: string (nullable = true)
 |-- eventTime: string (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- rating: double (nullable = true)
 |-- targetEntityId: string (nullable = true)
 |-- targetEntityType: string (nullable = true)
 

Building Zeppelin for Apache Spark 1.2+

Start by cloning Zeppelin.

1	$ git clone https://github.com/apache/zeppelin.git

Build Zeppelin with Hadoop 2.4 and Spark 1.2 profiles.

$ cd zeppelin
$ mvn clean package -Pspark-1.2 -Dhadoop.version=2.4.0 -Phadoop-2.4 -DskipTests
 

Now you should have working Zeppelin binaries.

Preparing Zeppelin

First, start Zeppelin.

1	$ bin/zeppelin-daemon.sh start

By default, you should be able to access Zeppelin via web browser at http://localhost:8080. Create a new notebook and put the following in the first cell.

sqlc.parquetFile("/tmp/movies").registerTempTable("events")
 

Preparing Zeppelin

Performing Analysis with Zeppelin

If all steps above ran successfully, you should have a ready-to-use analytics environment by now. Let's try a few examples to see if everything is functional.

In the second cell, put in this piece of code and run it.

%sql
SELECT entityType, event, targetEntityType, COUNT(*) AS c FROM events
GROUP BY entityType, event, targetEntityType
 

Summary of Events

We can also easily plot a pie chart.

%sql
SELECT event, COUNT(*) AS c FROM events GROUP BY event
 

Summary of Event in Pie Chart

And see a breakdown of rating values.

%sql
SELECT properties.rating AS r, COUNT(*) AS c FROM events
WHERE properties.rating IS NOT NULL GROUP BY properties.rating ORDER BY r
 

Breakdown of Rating Values

Happy analyzing!

Machine Learning Analytics with Zeppelin

PredictionIO Docs

Machine Learning Analytics with Zeppelin

On this page

Machine Learning Analytics with Zeppelin

Prerequisites

Export Events to Apache Parquet

Building Zeppelin for Apache Spark 1.2+

Preparing Zeppelin

Performing Analysis with Zeppelin