public class SparkDl4jMultiLayer extends SparkListenable
| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_EVAL_SCORE_BATCH_SIZE |
static int |
DEFAULT_ROC_THRESHOLD_STEPS |
trainingMaster| Constructor and Description |
|---|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster) |
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
Training constructor.
|
SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster)
Instantiate a multi layer spark instance
with the given context and network.
|
| Modifier and Type | Method and Description |
|---|---|
double |
calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
Calculate the score for all examples in the provided
JavaRDD<DataSet>, either by summing
or averaging over the entire data set. |
double |
calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average,
int minibatchSize)
Calculate the score for all examples in the provided
JavaRDD<DataSet>, either by summing
or averaging over the entire data set. |
double |
calculateScore(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
|
<T extends IEvaluation> |
doEvaluation(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
T emptyEvaluation,
int evalBatchSize)
Perform distributed evaluation of any type of
IEvaluation. |
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Evaluate the network (classification performance) in a distributed manner on the provided data
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList)
Evaluate the network (classification performance) in a distributed manner, using default batch size and a provided
list of labels
|
Evaluation |
evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList,
int evalBatchSize)
Evaluate the network (classification performance) in a distributed manner, using specified batch size and a provided
list of labels
|
Evaluation |
evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data)
RDD<DataSet> overload of evaluate(JavaRDD) |
Evaluation |
evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
java.util.List<java.lang.String> labelsList)
RDD<DataSet> overload of evaluate(JavaRDD, List) |
RegressionEvaluation |
evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Evaluate the network (regression performance) in a distributed manner on the provided data
|
RegressionEvaluation |
evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int minibatchSize)
Evaluate the network (regression performance) in a distributed manner on the provided data
|
ROC |
evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Perform ROC analysis/evaluation on the given DataSet in a distributed manner, using the default number of
threshold steps (
DEFAULT_ROC_THRESHOLD_STEPS) and the default minibatch size (DEFAULT_EVAL_SCORE_BATCH_SIZE) |
ROC |
evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int thresholdSteps,
int evaluationMinibatchSize)
Perform ROC analysis/evaluation on the given DataSet in a distributed manner
|
ROCMultiClass |
evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
Perform ROC analysis/evaluation (for the multi-class case, using
ROCMultiClass on the given DataSet in a distributed manner |
ROCMultiClass |
evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
int thresholdSteps,
int evaluationMinibatchSize)
Perform ROC analysis/evaluation (for the multi-class case, using
ROCMultiClass on the given DataSet in a distributed manner |
<K> org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> |
feedForwardWithKey(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> featuresData,
int batchSize)
Feed-forward the specified data, with the given keys.
|
MultiLayerNetwork |
fit(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> trainingData)
Fit the DataSet RDD
|
MultiLayerNetwork |
fit(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> trainingData)
Fit the DataSet RDD.
|
MultiLayerNetwork |
fit(java.lang.String path)
Fit the SparkDl4jMultiLayer network using a directory of serialized DataSet objects
The assumption here is that the directory contains a number of
DataSet objects, each serialized using
DataSet.save(OutputStream) |
MultiLayerNetwork |
fit(java.lang.String path,
int minPartitions)
Deprecated.
Use
fit(String) |
MultiLayerNetwork |
fitContinuousLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
Fits a MultiLayerNetwork using Spark MLLib LabeledPoint instances
This will convert labeled points that have continuous labels used for regression to the internal
DL4J data format and train the model on that
|
MultiLayerNetwork |
fitLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
Fit a MultiLayerNetwork using Spark MLLib LabeledPoint instances.
|
MultiLayerNetwork |
fitPaths(org.apache.spark.api.java.JavaRDD<java.lang.String> paths)
Fit the network using a list of paths for serialized DataSet objects.
|
MultiLayerNetwork |
getNetwork() |
double |
getScore()
Gets the last (average) minibatch score from calling fit.
|
org.apache.spark.api.java.JavaSparkContext |
getSparkContext() |
SparkTrainingStats |
getSparkTrainingStats()
Get the training statistics, after collection of stats has been enabled using
setCollectTrainingStats(boolean) |
TrainingMaster |
getTrainingMaster() |
org.apache.spark.mllib.linalg.Matrix |
predict(org.apache.spark.mllib.linalg.Matrix features)
Predict the given feature matrix
|
org.apache.spark.mllib.linalg.Vector |
predict(org.apache.spark.mllib.linalg.Vector point)
Predict the given vector
|
<K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE. |
<K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> |
scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
Score the examples individually, using the default batch size
DEFAULT_EVAL_SCORE_BATCH_SIZE. |
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size.
|
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
RDD<DataSet> overload of scoreExamples(JavaPairRDD, boolean) |
org.apache.spark.api.java.JavaDoubleRDD |
scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
RDD<DataSet>
overload of scoreExamples(JavaRDD, boolean, int) |
void |
setCollectTrainingStats(boolean collectTrainingStats)
Set whether training statistics should be collected for debugging purposes.
|
void |
setNetwork(MultiLayerNetwork network)
Set the network that underlies this SparkDl4jMultiLayer instacne
|
void |
setScore(double lastScore) |
setListeners, setListeners, setListeners, setListenerspublic static final int DEFAULT_EVAL_SCORE_BATCH_SIZE
public static final int DEFAULT_ROC_THRESHOLD_STEPS
public SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster)
sparkContext - the spark context to usenetwork - the network to usepublic SparkDl4jMultiLayer(org.apache.spark.SparkContext sparkContext,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
sparkContext - the spark context to useconf - the configuration of the networkpublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext sc,
MultiLayerConfiguration conf,
TrainingMaster<?,?> trainingMaster)
sc - the spark context to useconf - the configuration of the networkpublic SparkDl4jMultiLayer(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
MultiLayerNetwork network,
TrainingMaster<?,?> trainingMaster)
public org.apache.spark.api.java.JavaSparkContext getSparkContext()
public MultiLayerNetwork getNetwork()
public TrainingMaster getTrainingMaster()
public void setNetwork(MultiLayerNetwork network)
network - network to setpublic void setCollectTrainingStats(boolean collectTrainingStats)
collectTrainingStats - If true: collect training statistics. If false: don't collect.public SparkTrainingStats getSparkTrainingStats()
setCollectTrainingStats(boolean)public org.apache.spark.mllib.linalg.Matrix predict(org.apache.spark.mllib.linalg.Matrix features)
features - the given feature matrixpublic org.apache.spark.mllib.linalg.Vector predict(org.apache.spark.mllib.linalg.Vector point)
point - the vector to predictpublic MultiLayerNetwork fit(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> trainingData)
trainingData - the training data RDD to fitDataSetpublic MultiLayerNetwork fit(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> trainingData)
trainingData - the training data RDD to fitDataSetpublic MultiLayerNetwork fit(java.lang.String path)
DataSet objects, each serialized using
DataSet.save(OutputStream)path - Path to the directory containing the serialized DataSet objcets@Deprecated public MultiLayerNetwork fit(java.lang.String path, int minPartitions)
fit(String)public MultiLayerNetwork fitPaths(org.apache.spark.api.java.JavaRDD<java.lang.String> paths)
paths - List of pathspublic MultiLayerNetwork fitLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
rdd - the rdd to fitDataSetpublic MultiLayerNetwork fitContinuousLabeledPoint(org.apache.spark.api.java.JavaRDD<org.apache.spark.mllib.regression.LabeledPoint> rdd)
rdd - the javaRDD containing the labeled pointspublic double getScore()
public void setScore(double lastScore)
public double calculateScore(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
public double calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average)
JavaRDD<DataSet>, either by summing
or averaging over the entire data set. To calculate a score for each example individually, use scoreExamples(JavaPairRDD, boolean)
or one of the similar methods. Uses default minibatch size in each worker, DEFAULT_EVAL_SCORE_BATCH_SIZEdata - Data to scoreaverage - Whether to sum the scores, or average thempublic double calculateScore(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean average,
int minibatchSize)
JavaRDD<DataSet>, either by summing
or averaging over the entire data set. To calculate a score for each example individually, use scoreExamples(JavaPairRDD, boolean)
or one of the similar methodsdata - Data to scoreaverage - Whether to sum the scores, or average themminibatchSize - The number of examples to use in each minibatch when scoring. If more examples are in a partition than
this, multiple scoring operations will be done (to avoid using too much memory by doing the whole partition
in one go)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
RDD<DataSet> overload of scoreExamples(JavaPairRDD, boolean)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE. Unlike calculateScore(JavaRDD, boolean),
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean) or scoreExamples(JavaPairRDD, boolean, int) which can have
a key for each example.data - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)MultiLayerNetwork.scoreExamples(DataSet, boolean)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
RDD<DataSet>
overload of scoreExamples(JavaRDD, boolean, int)public org.apache.spark.api.java.JavaDoubleRDD scoreExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
calculateScore(JavaRDD, boolean),
this method returns a score for each example separately. If scoring is needed for specific examples use either
scoreExamples(JavaPairRDD, boolean) or scoreExamples(JavaPairRDD, boolean, int) which can have
a key for each example.data - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)batchSize - Batch size to use when doing scoringMultiLayerNetwork.scoreExamples(DataSet, boolean)public <K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms)
DEFAULT_EVAL_SCORE_BATCH_SIZE. Unlike calculateScore(JavaRDD, boolean),
this method returns a score for each example separatelyK - Key typedata - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double> containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)public <K> org.apache.spark.api.java.JavaPairRDD<K,java.lang.Double> scoreExamples(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.dataset.DataSet> data,
boolean includeRegularizationTerms,
int batchSize)
calculateScore(JavaRDD, boolean),
this method returns a score for each example separatelyK - Key typedata - Data to scoreincludeRegularizationTerms - If true: include the l1/l2 regularization terms with the score (if any)JavaPairRDD<K,Double> containing the scores of each exampleMultiLayerNetwork.scoreExamples(DataSet, boolean)public <K> org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> feedForwardWithKey(org.apache.spark.api.java.JavaPairRDD<K,org.nd4j.linalg.api.ndarray.INDArray> featuresData,
int batchSize)
K - Type of data for key - may be anythingfeaturesData - Features data to feed through the networkbatchSize - Batch size to use when doing feed forward operationspublic Evaluation evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data)
RDD<DataSet> overload of evaluate(JavaRDD)public Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
data - Data to evaluate onpublic Evaluation evaluate(org.apache.spark.rdd.RDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList)
RDD<DataSet> overload of evaluate(JavaRDD, List)public RegressionEvaluation evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
data - Data to evaluateRegressionEvaluation instance with regression performancepublic RegressionEvaluation evaluateRegression(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int minibatchSize)
data - Data to evaluateminibatchSize - Minibatch size to use when doing performing evaluationRegressionEvaluation instance with regression performancepublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList)
data - Data to evaluate onlabelsList - List of labels used for evaluationpublic ROC evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
DEFAULT_ROC_THRESHOLD_STEPS) and the default minibatch size (DEFAULT_EVAL_SCORE_BATCH_SIZE)data - Test set data (to evaluate on)public ROC evaluateROC(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int thresholdSteps, int evaluationMinibatchSize)
data - Test set data (to evaluate on)thresholdSteps - Number of threshold steps for ROC - see ROCevaluationMinibatchSize - Minibatch size to use when performing ROC evaluationpublic ROCMultiClass evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data)
ROCMultiClass on the given DataSet in a distributed mannerdata - Test set data (to evaluate on)public ROCMultiClass evaluateROCMultiClass(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, int thresholdSteps, int evaluationMinibatchSize)
ROCMultiClass on the given DataSet in a distributed mannerdata - Test set data (to evaluate on)thresholdSteps - Number of threshold steps for ROC - see ROCevaluationMinibatchSize - Minibatch size to use when performing ROC evaluationpublic Evaluation evaluate(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, java.util.List<java.lang.String> labelsList, int evalBatchSize)
data - Data to evaluate onlabelsList - List of labels used for evaluationevalBatchSize - Batch size to use when conducting evaluationspublic <T extends IEvaluation> T doEvaluation(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> data, T emptyEvaluation, int evalBatchSize)
IEvaluation. For example, Evaluation, RegressionEvaluation,
ROC, ROCMultiClass etc.T - Type of evaluation instance to returndata - Data to evaluate onemptyEvaluation - Empty evaluation instance. This is the starting point (serialized/duplicated, then merged)evalBatchSize - Evaluation batch size