Provides building blocks for writing a complete prediction engine consisting of DataSource, Preparator, Algorithm, Serving, and Evaluation.
Provides building blocks for writing a complete prediction engine consisting of DataSource, Preparator, Algorithm, Serving, and Evaluation.
The starting point of a prediction engine is the Engine class.
The building blocks together form the DASE paradigm. Learn more about DASE here.
Depending on the problem you are solving, you would need to pick appropriate flavors of building blocks.
There are 3 typical engine configurations:
In both configurations 1 and 2, data is sourced and prepared in a parallelized fashion, with data type as RDD.
The difference between configurations 1 and 2 come at the algorithm stage. In configuration 1, the algorithm operates on potentially large data as RDDs in the Spark cluster, and eventually outputs a model that is small enough to fit in a single machine.
On the other hand, configuration 2 outputs a model that is potentially too large to fit in a single machine, and must reside in the Spark cluster as RDD(s).
With configuration 1 (P2LAlgorithm), PredictionIO will automatically try to persist the model to local disk or HDFS if the model is serializable.
With configuration 2 (PAlgorithm), PredictionIO will not automatically try to persist the model, unless the model implements the PersistentModel trait.
In special circumstances where both the data and the model are small, configuration 3 may be used. Beware that RDDs cannot be used with configuration 3.
PDataSource is probably the most used data source base class with the ability to process RDD-based data. LDataSource cannot handle RDD-based data. Use only when you have a special requirement.
With PDataSource, you must pick PPreparator. The same applies to LDataSource and LPreparator.
The workhorse of the engine comes in 3 different flavors.
Produces a model that is small enough to fit in a single machine from PDataSource and PPreparator. The model cannot contain any RDD. If the produced model is serializable, PredictionIO will try to automatically persist it. In addition, P2LAlgorithm.batchPredict is already implemented for Evaluation purpose.
Produces a model that could contain RDDs from PDataSource and PPreparator. PredictionIO will not try to persist it automatically unless the model implements PersistentModel. PAlgorithm.batchPredict must be implemented for Evaluation.
Produces a model that is small enough to fit in a single machine from LDataSource and LPreparator. The model cannot contain any RDD. If the produced model is serializable, PredictionIO will try to automatically persist it. In addition, LAlgorithm.batchPredict is already implemented for Evaluation purpose.
The serving component comes with only 1 flavor--LServing. At the serving stage, it is assumed that the result being served is already at a human- consumable size.
PredictionIO tries its best to persist trained models automatically. Please refer to LAlgorithm.makePersistentModel, P2LAlgorithm.makePersistentModel, and PAlgorithm.makePersistentModel for descriptions on different strategies.
Core base classes of PredictionIO controller components.
Core base classes of PredictionIO controller components. Engine developers should not use these directly.
Provides data access for PredictionIO and any engines running on top of PredictionIO
Independent library of code that is useful for engine development and evaluation
PredictionIO Scala API