PredictionIO has a thin storage layer to abstract meta data, event data, and model data access. The layer defines a set of standard interfaces to support multiple data store backends. PredictionIO users can configure the backend of choice through configuration files or environmental variables. Engine developers need not worry about the actual underlying storage architecture. Advanced developers can implement their own backend driver as an external library.
Concepts
In this section, we will visit some storage layer concepts that are common to users, engine developers, and advanced developers:
Repository is the highest level of data access abstraction and is where all engines and PredictionIO itself access data with.
Source is the actual data store backend that provide data access. A source is an implementation of the set of data access interfaces defined by repositories.
Each of them will be explained in detail below:
Repositories
Repository is the highest level of data access abstraction and is where all engines and PredictionIO itself access data with.
The storage layer currently defines three mandatory data repositories: meta data, event data, and model data. Each repository has its own set of data access interfaces.
Meta data is used by PredictionIO to store engine training and evaluation information. Commands like
pio build
,pio train
,pio deploy
, andpio eval
all access meta data.Event data is used by the Event Server to collect events, and by engines to source data.
Model data is used by PredictionIO for automatic persistence of trained models.
The following configuration variables are used for configure these repositories:
- Meta data is configured by the
PIO_STORAGE_REPOSITORIES_METADATA_XXX
variables. - Event data is configured by the
PIO_STORAGE_REPOSITORIES_EVENTDATA_XXX
variables. - Model data is configured by the
PIO_STORAGE_REPOSITORIES_MODELDATA_XXX
variables.
Configuration variables will be explained in more details in later sections below (see Data Store Configuration).
For example, you may see the following configuration variables defined in conf/pio-env.sh
1 2 3 4 5 6 7 8 | PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=predictionio_eventdata PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_ PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS |
The configuration variable with the NAME suffix controls the namespace used by the source.
The configuration variable with the SOURCE suffix points to the actual source that will back this repository. Source will be explained below.
Sources
Sources are actual data store backends that provide data access. A source is an implementation of the set of data access interfaces defined by repositories.
PredictionIO comes with the following sources:
JDBC (tested on MySQL and PostgreSQL):
- Type name is jdbc.
- Can be used for Meta Data, Event Data and Model Data repositories
Elasticsearch:
- Type name is elasticsearch
- Can be used for Meta Data repository
Apache HBase:
- Type name is hbase
- Can be used for Event Data repository
Local file system:
- Type name is localfs
- Can be used for Model Data repository
HDFS:
- Type name is hdfs.
- Can be used for Model Data repository
S3:
- Type name is s3.
- Can be used for Model Data repository
Each repository can be configured to use different sources as shown above.
Each source has its own set of configuration parameters. Configuration variables will be explained in more details in later sections below (see Data Store Configuration).
The following is an example source configuration with name "PGSQL" with type jdbc
:
1 2 3 4 | PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql:predictionio PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio |
The following is an example of using this source "PGSQL" for the meta data repository:
1 2 | PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=PGSQL |
Data Store Configuration
Data store configuration is done by settings environmental variables. If you set them inside conf/pio-env.sh
, they will be automatically available whenever you perform a pio
command, e.g. pio train
.
Notice that all variables are prefixed by PIO_STORAGE_
.
Repositories Configuration
Variable Format: PIO_STORAGE_REPOSITORIES_<REPO>_<KEY>
Configuration variables of repositories are prefixed by PIO_STORAGE_REPOSITORIES_
, followed by the repository name (e.g. METADATA
), and then either NAME
or SOURCE
.
Consider the following example:
1 2 | PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=PGSQL |
The above configures PredictionIO to look for a source configured with the name PGSQL
, and use predictionio_metadata
as the namespace within such source. There is no restriction on namespace usage by the source, so behavior may vary. As an example, the official JDBC source uses the namespace as database table prefix.
Sources Configuration
Variable Format: PIO_STORAGE_SOURCES_<NAME>_<KEY>
Configuration variables of sources are prefixed by PIO_STORAGE_SOURCES_
, followed by the source name of choice (e.g. PGSQL
, MYSQL
, HBASE
, etc), and a configuration KEY
.
Depending on what the source TYPE
is, different configuration keys are required.
JDBC Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=jdbc
Supported Repositories: meta, event, model
Tested on: MySQL 5.1+, PostgreSQL 9.1+
When TYPE
is set to jdbc
, the following configuration keys are supported.
URL (mandatory)
The value must be a valid JDBC URL that points to a database, e.g.
PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql:predictionio
USERNAME (mandatory)
The value must be a valid, non-empty username for the JDBC connection, e.g.
PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio_user
PASSWORD (mandatory)
The value must be a valid, non-empty password for the JDBC connection, e.g.
PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio_user_password
PARTITIONS (optional, default to 4)
This value is used by Apache Spark to determine the number of partitions to use when it reads from the JDBC connection, e.g.
PIO_STORAGE_SOURCES_PGSQL_PARTITIONS=4
CONNECTIONS (optional, default to 8)
This value is used by scalikejdbc library to determine the max size of connection pool, e.g.
PIO_STORAGE_SOURCES_PGSQL_CONNECTIONS=8
INDEX (optional since v0.9.6, default to disabled)
This value is used by creating indexes on entityId and entityType columns to improve performance when findByEntity function is called. Note that these columns of entityId and entityType will be created as varchar(255), e.g.
PIO_STORAGE_SOURCES_PGSQL_INDEX=enabled
Apache HBase Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=hbase
Supported Repositories: event
Tested on: Apache HBase 0.98.5+, 1.0.0+
When TYPE
is set to hbase
, no other configuration keys are required. Other client side HBase configuration must be done through hbase-site.xml
pointed by the HBASE_CONF_DIR
configuration variable.
Elasticsearch Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=elasticsearch
Supported Repositories: meta
When TYPE
is set to elasticsearch
, the following configuration keys are supported.
HOSTS (mandatory)
Comma-separated list of hostnames, e.g.
PIO_STORAGE_SOURCES_ES_HOSTS=es1,es2,es3
PORTS (mandatory)
Comma-separated list of ports that corresponds to
HOSTS
, e.g.PIO_STORAGE_SOURCES_ES_PORTS=9200,9200,9222
CLUSTERNAME (optional, default to
elasticsearch
)Elasticsearch cluster name, e.g.
PIO_STORAGE_SOURCES_ES_CLUSTERNAME=myescluster
Local File System Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=localfs
Supported Repositories: model
When TYPE
is set to localfs
, the following configuration keys are supported.
PATH (mandatory)
File system path at where models are stored, e.g.
PIO_STORAGE_SOURCES_FS_PATH=/mymodels
HDFS Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=hdfs
Supported Repositories: model
When TYPE
is set to hdfs
, the following configuration keys are supported.
PATH (mandatory)
HDFS path at where models are stored, e.g.
PIO_STORAGE_SOURCES_HDFS_PATH=/mymodels
S3 Configuration
Variable Format: PIO_STORAGE_SOURCES_[NAME]_TYPE=s3
Supported Repositories: model
To provide authentication information, you can set the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables or use one of the other methods in the AWS Setup Docs
When TYPE
is set to s3
, the following configuration keys are supported.
REGION (mandatory)
AWS Region to use, e.g.
PIO_STORAGE_SOURCES_S3_REGION=us-east-1
BUCKET_NAME (mandatory)
S3 Bucket where models are stored, e.g.
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio_bucket
BASE_PATH (optional)
S3 base path where models are stored, e.g.
PIO_STORAGE_SOURCES_S3_BASE_PATH=pio_model
DISABLE_CHUNKED_ENCODING (optional)
Disable the use of Chunked Encoding when transferring files to/from S3, e.g.
PIO_STORAGE_SOURCES_S3_DISABLE_CHUNKED_ENCODING=true
ENDPOINT (optional)
S3 Endpoint to use, e.g.
PIO_STORAGE_SOURCES_S3_ENDPOINT=http://localstack:4572
Adding Support of Other Backends
It is quite straightforward to implement support of other backends. A good starting point is to reference the JDBC implementation inside the org.apache.predictionio.data.storage.jdbc package.
Contributions of different backends implementation is highly encouraged. To start contributing, please refer to this guide.
Deploying Your Custom Backend Support as a Plugin
It is possible to deploy your custom backend implementation as a standalone JAR apart from the main PredictionIO binary distribution. The following is an outline of how this can be achieved.
Create an SBT project with a library dependency on PredictionIO's data access base traits (inside the
data
artifact).Implement traits that you intend to support, and package everything into a big fat JAR (e.g. sbt-assembly).
Create a directory named
plugins
inside PredictionIO binary installation.Copy the JAR from step 2 to
plugins
.In storage configuration, specify
TYPE
as your complete package name. As an example, if you have implemented all your traits under the package nameorg.mystorage.jdbc
, use something like1 2 3
PIO_STORAGE_SOURCES_MYJDBC_TYPE=org.mystorage.jdbc ... PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=MYJDBC
to instruct PredictionIO to pick up
StorageClient
from the appropriate package.Now you should be able to use your custom source and assign it to different repositories as you wish.