public class SparkDl4jMultiLayer extends SparkListenable
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_EVAL_SCORE_BATCH_SIZE |
static int |
DEFAULT_ROC_THRESHOLD_STEPS |
trainingMaster
Constructor and Description |
---|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster) |
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster)
Instantiate a multi layer spark instance
with the given context and network.
|
Modifier and Type | Method and Description |
---|---|
double |
calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
Calculate the score for all examples in the provided
JavaRDD<DataSet> , either by summing
or averaging over the entire data set. |
double |
calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average,
int minibatchSize)
Calculate the score for all examples in the provided
JavaRDD<DataSet> , either by summing
or averaging over the entire data set. |
double |
calculateScore(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
|
<T extends IEvaluation> |
doEvaluation(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
T emptyEvaluation,
int evalBatchSize)
Perform distributed evaluation of any type of
IEvaluation . |
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Evaluate the network (classification performance) in a distributed manner on the provided data
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList)
Evaluate the network (classification performance) in a distributed manner, using default batch size and a provided
list of labels
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList,
int evalBatchSize)
Evaluate the network (classification performance) in a distributed manner, using specified batch size and a provided
list of labels
|
Evaluation |
evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data)
RDD<DataSet> overload of evaluate(JavaRDD) |
Evaluation |
evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList)
RDD<DataSet> overload of evaluate(JavaRDD, List) |
RegressionEvaluation |
evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Evaluate the network (regression performance) in a distributed manner on the provided data
|
RegressionEvaluation |
evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int minibatchSize)
Evaluate the network (regression performance) in a distributed manner on the provided data
|
ROC |
evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Perform ROC analysis/evaluation on the given DataSet in a distributed manner, using the default number of
threshold steps (
DEFAULT_ROC_THRESHOLD_STEPS ) and the default minibatch size (DEFAULT_EVAL_SCORE_BATCH_SIZE ) |
ROC |
evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int thresholdSteps,
int evaluationMinibatchSize)
Perform ROC analysis/evaluation on the given DataSet in a distributed manner
|
ROCMultiClass |
evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Perform ROC analysis/evaluation (for the multi-class case, using
ROCMultiClass on the given DataSet in a distributed manner |
ROCMultiClass |
evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int thresholdSteps,
int evaluationMinibatchSize)
Perform ROC analysis/evaluation (for the multi-class case, using
ROCMultiClass on the given DataSet in a distributed manner |
<K> org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> |
feedForwardWithKey(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> featuresData,
int batchSize)
Feed-forward the specified data, with the given keys.
|
MultiLayerNetwork |
fit(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> trainingData)
Fit the DataSet RDD
|
MultiLayerNetwork |
fit(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> trainingData)
Fit the DataSet RDD.
|
MultiLayerNetwork |
fit(java.lang.String path)
Fit the SparkDl4jMultiLayer network using a directory of serialized DataSet objects
The assumption here is that the directory contains a number of
DataSet objects, each serialized using
DataSet.save(OutputStream) |
MultiLayerNetwork |
fit(java.lang.String path,
int minPartitions)
Deprecated.
Use
fit(String) |
MultiLayerNetwork |
fitContinuousLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
Fits a MultiLayerNetwork using Spark MLLib LabeledPoint instances
This will convert labeled points that have continuous labels used for regression to the internal
DL4J data format and train the model on that
|
MultiLayerNetwork |
fitLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
Fit a MultiLayerNetwork using Spark MLLib LabeledPoint instances.
|
MultiLayerNetwork |
fitPaths(org.apache.spark.api.java.JavaRDD<java.lang.String> paths)
Fit the network using a list of paths for serialized DataSet objects.
|
MultiLayerNetwork |
getNetwork() |
double |
getScore()
Gets the last (average) minibatch score from calling fit.
|
org.apache.spark.api.java.JavaSparkContext |
getSparkContext() |
SparkTrainingStats |
getSparkTrainingStats()
Get the training statistics, after collection of stats has been enabled using
setCollectTrainingStats(boolean) |
TrainingMaster |
getTrainingMaster() |
org.apache.spark.mllib.linalg.Matrix |
predict(org.apache.spark.mllib.linalg.Matrix features)
Predict the given feature matrix
|
org.apache.spark.mllib.linalg.Vector |
predict(org.apache.spark.mllib.linalg.Vector point)
Predict the given vector
|
<K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE . |
<K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE . |
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
RDD<DataSet> overload of scoreExamples(JavaPairRDD, boolean) |
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
RDD<DataSet>
overload of scoreExamples(JavaRDD, boolean, int) |
void |
setCollectTrainingStats(boolean collectTrainingStats)
Set whether training statistics should be collected for debugging purposes.
|
void |
setNetwork(MultiLayerNetwork network)
Set the network that underlies this SparkDl4jMultiLayer instacne
|
void |
setScore(double lastScore) |
setListeners, setListeners, setListeners, setListeners
public static final int DEFAULT_EVAL_SCORE_BATCH_SIZE
public static final int DEFAULT_ROC_THRESHOLD_STEPS
public SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext, MultiLayerNetwork network, TrainingMaster<?,?> trainingMaster)
sparkContext
- the spark context to usenetwork
- the network to usepublic SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext, MultiLayerConfiguration conf, TrainingMaster<?,?> trainingMaster)
sparkContext
- the spark context to useconf
- the configuration of the networkpublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc, MultiLayerConfiguration conf, TrainingMaster<?,?> trainingMaster)
sc
- the spark context to useconf
- the configuration of the networkpublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext, MultiLayerNetwork network, TrainingMaster<?,?> trainingMaster)
public org.apache.spark.api.java.JavaSparkContext getSparkContext()
public MultiLayerNetwork getNetwork()
public TrainingMaster getTrainingMaster()
public void setNetwork(MultiLayerNetwork network)
network
- network to setpublic void setCollectTrainingStats(boolean collectTrainingStats)
collectTrainingStats
- If true: collect training statistics. If false: don't collect.public SparkTrainingStats getSparkTrainingStats()
setCollectTrainingStats(boolean)
public org.apache.spark.mllib.linalg.Matrix predict(org.apache.spark.mllib.linalg.Matrix features)
features
- the given feature matrixpublic org.apache.spark.mllib.linalg.Vector predict(org.apache.spark.mllib.linalg.Vector point)
point
- the vector to predictpublic MultiLayerNetwork fit(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> trainingData)
trainingData
- the training data RDD to fitDataSetpublic MultiLayerNetwork fit(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> trainingData)
trainingData
- the training data RDD to fitDataSetpublic MultiLayerNetwork fit(java.lang.String path)
DataSet
objects, each serialized using
DataSet.save(OutputStream)
path
- Path to the directory containing the serialized DataSet objcets@Deprecated public MultiLayerNetwork fit(java.lang.String path, int minPartitions)
fit(String)
public MultiLayerNetwork fitPaths(org.apache.spark.api.java.JavaRDD<java.lang.String> paths)
paths
- List of pathspublic MultiLayerNetwork fitLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
rdd
- the rdd to fitDataSetpublic MultiLayerNetwork fitContinuousLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
rdd
- the javaRDD containing the labeled pointspublic double getScore()
public void setScore(double lastScore)
public double calculateScore(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data, boolean average)
public double calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, boolean average)
JavaRDD<DataSet>
, either by summing
or averaging over the entire data set. To calculate a score for each example individually, use scoreExamples(JavaPairRDD, boolean)
or one of the similar methods. Uses default minibatch size in each worker, DEFAULT_EVAL_SCORE_BATCH_SIZE
data
- Data to scoreaverage
- Whether to sum the scores, or average thempublic double calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, boolean average, int minibatchSize)
JavaRDD<DataSet>
, either by summing
or averaging over the entire data set. To calculate a score for each example individually, use scoreExamples(JavaPairRDD, boolean)
or one of the similar methodsdata
- Data to scoreaverage
- Whether to sum the scores, or average themminibatchSize
- The number of examples to use in each minibatch when scoring. If more examples are in a partition than
this, multiple scoring operations will be done (to avoid using too much memory by doing the whole partition
in one go)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms)
RDD<DataSet>
overload of scoreExamples(JavaPairRDD, boolean)
public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE
. Unlike calculateScore(JavaRDD, boolean)
,
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean)
or scoreExamples(JavaPairRDD, boolean, int)
which can have
a key for each example.data
- Data to scoreincludeRegularizationTerms
- If true: include the l1/l2 regularization terms with the score (if any)MultiLayerNetwork.scoreExamples(DataSet, boolean)
public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms, int batchSize)
RDD<DataSet>
overload of scoreExamples(JavaRDD, boolean, int)
public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms, int batchSize)
calculateScore(JavaRDD, boolean)
,
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean)
or scoreExamples(JavaPairRDD, boolean, int)
which can have
a key for each example.data
- Data to scoreincludeRegularizationTerms
- If true: include the l1/l2 regularization terms with the score (if any)batchSize
- Batch size to use when doing scoringMultiLayerNetwork.scoreExamples(DataSet, boolean)
public <K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE
. Unlike calculateScore(JavaRDD, boolean)
,
this method returns a score for each example separatelyK
- Key typedata
- Data to scoreincludeRegularizationTerms
- If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double>
containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)
public <K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data, boolean includeRegularizationTerms, int batchSize)
calculateScore(JavaRDD, boolean)
,
this method returns a score for each example separatelyK
- Key typedata
- Data to scoreincludeRegularizationTerms
- If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double>
containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)
public <K> org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> feedForwardWithKey(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> featuresData, int batchSize)
K
- Type of data for key - may be anythingfeaturesData
- Features data to feed through the networkbatchSize
- Batch size to use when doing feed forward operationspublic Evaluation evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data)
RDD<DataSet>
overload of evaluate(JavaRDD)
public Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
data
- Data to evaluate onpublic Evaluation evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList)
RDD<DataSet>
overload of evaluate(JavaRDD, List)
public RegressionEvaluation evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
data
- Data to evaluateRegressionEvaluation
instance with regression performancepublic RegressionEvaluation evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int minibatchSize)
data
- Data to evaluateminibatchSize
- Minibatch size to use when doing performing evaluationRegressionEvaluation
instance with regression performancepublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList)
data
- Data to evaluate onlabelsList
- List of labels used for evaluationpublic ROC evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
DEFAULT_ROC_THRESHOLD_STEPS
) and the default minibatch size (DEFAULT_EVAL_SCORE_BATCH_SIZE
)data
- Test set data (to evaluate on)public ROC evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int thresholdSteps, int evaluationMinibatchSize)
data
- Test set data (to evaluate on)thresholdSteps
- Number of threshold steps for ROC - see ROC
evaluationMinibatchSize
- Minibatch size to use when performing ROC evaluationpublic ROCMultiClass evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
ROCMultiClass
on the given DataSet in a distributed mannerdata
- Test set data (to evaluate on)public ROCMultiClass evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int thresholdSteps, int evaluationMinibatchSize)
ROCMultiClass
on the given DataSet in a distributed mannerdata
- Test set data (to evaluate on)thresholdSteps
- Number of threshold steps for ROC - see ROC
evaluationMinibatchSize
- Minibatch size to use when performing ROC evaluationpublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList, int evalBatchSize)
data
- Data to evaluate onlabelsList
- List of labels used for evaluationevalBatchSize
- Batch size to use when conducting evaluationspublic <T extends IEvaluation> T doEvaluation(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, T emptyEvaluation, int evalBatchSize)
IEvaluation
. For example, Evaluation
, RegressionEvaluation
,
ROC
, ROCMultiClass
etc.T
- Type of evaluation instance to returndata
- Data to evaluate onemptyEvaluation
- Empty evaluation instance. This is the starting point (serialized/duplicated, then merged)evalBatchSize
- Evaluation batch size