SparkUtils

java.lang.Object
- org.deeplearning4j.spark.util.SparkUtils

public class SparkUtils
extends java.lang.Object

Various utilities for Spark

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static <T,U> org.apache.spark.api.java.JavaPairRDD<T,U>[]`	`balancedRandomSplit(int totalObjectCount, int numObjectsPerSplit, org.apache.spark.api.java.JavaPairRDD<T,U> data)` Equivalent to `balancedRandomSplit(int, int, JavaRDD)` but for Pair RDDs
`static <T,U> org.apache.spark.api.java.JavaPairRDD<T,U>[]`	`balancedRandomSplit(int totalObjectCount, int numObjectsPerSplit, org.apache.spark.api.java.JavaPairRDD<T,U> data, long rngSeed)` Equivalent to `balancedRandomSplit(int, int, JavaRDD)` but for pair RDDs, and with control over the RNG seed
`static <T> org.apache.spark.api.java.JavaRDD<T>[]`	`balancedRandomSplit(int totalObjectCount, int numObjectsPerSplit, org.apache.spark.api.java.JavaRDD<T> data)` Random split the specified RDD into a number of RDDs, where each has `numObjectsPerSplit` in them.
`static <T> org.apache.spark.api.java.JavaRDD<T>[]`	`balancedRandomSplit(int totalObjectCount, int numObjectsPerSplit, org.apache.spark.api.java.JavaRDD<T> data, long rngSeed)` Equivalent to `balancedRandomSplit(int, int, JavaRDD)` with control over the RNG seed
`static boolean`	`checkKryoConfiguration(org.apache.spark.api.java.JavaSparkContext javaSparkContext, org.slf4j.Logger log)` Check the spark configuration for incorrect Kryo configuration, logging a warning message if necessary
`static org.apache.spark.api.java.JavaRDD<java.lang.String>`	`listPaths(org.apache.spark.api.java.JavaSparkContext sc, java.lang.String path)` List of the files in the given directory (path), as a `JavaRDD<String>`
`static <T> T`	`readObjectFromFile(java.lang.String path, java.lang.Class<T> type, org.apache.spark.api.java.JavaSparkContext sc)` Read an object from HDFS (or local) using default Java object serialization
`static <T> T`	`readObjectFromFile(java.lang.String path, java.lang.Class<T> type, org.apache.spark.SparkContext sc)` Read an object from HDFS (or local) using default Java object serialization
`static java.lang.String`	`readStringFromFile(java.lang.String path, org.apache.spark.api.java.JavaSparkContext sc)` Read a UTF-8 format String from HDFS (or local)
`static java.lang.String`	`readStringFromFile(java.lang.String path, org.apache.spark.SparkContext sc)` Read a UTF-8 format String from HDFS (or local)
`static <T> org.apache.spark.api.java.JavaRDD<T>`	`repartition(org.apache.spark.api.java.JavaRDD<T> rdd, Repartition repartition, RepartitionStrategy repartitionStrategy, int objectsPerPartition, int numPartitions)` Repartition the specified RDD (or not) using the given `Repartition` and `RepartitionStrategy` settings
`static <T> org.apache.spark.api.java.JavaRDD<T>`	`repartitionBalanceIfRequired(org.apache.spark.api.java.JavaRDD<T> rdd, Repartition repartition, int objectsPerPartition, int numPartitions)` Repartition a RDD (given the `Repartition` setting) such that we have approximately `numPartitions` partitions, each of which has `objectsPerPartition` objects.
`static org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet>`	`shuffleExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd, int newBatchSize, int numPartitions)` Randomly shuffle the examples in each DataSet object, and recombine them into new DataSet objects with the specified BatchSize
`static void`	`writeObjectToFile(java.lang.String path, java.lang.Object toWrite, org.apache.spark.api.java.JavaSparkContext sc)` Write an object to HDFS (or local) using default Java object serialization
`static void`	`writeObjectToFile(java.lang.String path, java.lang.Object toWrite, org.apache.spark.SparkContext sc)` Write an object to HDFS (or local) using default Java object serialization
`static void`	`writeStringToFile(java.lang.String path, java.lang.String toWrite, org.apache.spark.api.java.JavaSparkContext sc)` Write a String to a file (on HDFS or local) in UTF-8 format
`static void`	`writeStringToFile(java.lang.String path, java.lang.String toWrite, org.apache.spark.SparkContext sc)` Write a String to a file (on HDFS or local) in UTF-8 format

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Detail

checkKryoConfiguration
```
public static boolean checkKryoConfiguration(org.apache.spark.api.java.JavaSparkContext javaSparkContext,
                                             org.slf4j.Logger log)
```
Check the spark configuration for incorrect Kryo configuration, logging a warning message if necessary

Parameters:

javaSparkContext - Spark context

log - Logger to log messages to

Returns:

True if ok (no kryo, or correct kryo setup)

writeStringToFile

public static void writeStringToFile(java.lang.String path,
                                     java.lang.String toWrite,
                                     org.apache.spark.api.java.JavaSparkContext sc)
                              throws java.io.IOException

Write a String to a file (on HDFS or local) in UTF-8 format

Parameters:: path - Path to write to; toWrite - String to write; sc - Spark context
Throws:: java.io.IOException

writeStringToFile

public static void writeStringToFile(java.lang.String path,
                                     java.lang.String toWrite,
                                     org.apache.spark.SparkContext sc)
                              throws java.io.IOException

Write a String to a file (on HDFS or local) in UTF-8 format

Parameters:: path - Path to write to; toWrite - String to write; sc - Spark context
Throws:: java.io.IOException

readStringFromFile

public static java.lang.String readStringFromFile(java.lang.String path,
                                                  org.apache.spark.api.java.JavaSparkContext sc)
                                           throws java.io.IOException

Read a UTF-8 format String from HDFS (or local)

Parameters:: path - Path to write the string; sc - Spark context
Throws:: java.io.IOException

readStringFromFile

public static java.lang.String readStringFromFile(java.lang.String path,
                                                  org.apache.spark.SparkContext sc)
                                           throws java.io.IOException

Read a UTF-8 format String from HDFS (or local)

Parameters:: path - Path to write the string; sc - Spark context
Throws:: java.io.IOException

writeObjectToFile

public static void writeObjectToFile(java.lang.String path,
                                     java.lang.Object toWrite,
                                     org.apache.spark.api.java.JavaSparkContext sc)
                              throws java.io.IOException

Write an object to HDFS (or local) using default Java object serialization

Parameters:: path - Path to write the object to; toWrite - Object to write; sc - Spark context
Throws:: java.io.IOException

writeObjectToFile

public static void writeObjectToFile(java.lang.String path,
                                     java.lang.Object toWrite,
                                     org.apache.spark.SparkContext sc)
                              throws java.io.IOException

Write an object to HDFS (or local) using default Java object serialization

Parameters:: path - Path to write the object to; toWrite - Object to write; sc - Spark context
Throws:: java.io.IOException

readObjectFromFile

public static <T> T readObjectFromFile(java.lang.String path,
                                       java.lang.Class<T> type,
                                       org.apache.spark.api.java.JavaSparkContext sc)
                                throws java.io.IOException

Read an object from HDFS (or local) using default Java object serialization

Type Parameters:: T - Type of the object to read
Parameters:: path - File to read; type - Class of the object to read; sc - Spark context
Throws:: java.io.IOException

readObjectFromFile

public static <T> T readObjectFromFile(java.lang.String path,
                                       java.lang.Class<T> type,
                                       org.apache.spark.SparkContext sc)
                                throws java.io.IOException

Read an object from HDFS (or local) using default Java object serialization

Type Parameters:: T - Type of the object to read
Parameters:: path - File to read; type - Class of the object to read; sc - Spark context
Throws:: java.io.IOException

repartition

public static <T> org.apache.spark.api.java.JavaRDD<T> repartition(org.apache.spark.api.java.JavaRDD<T> rdd,
                                                                   Repartition repartition,
                                                                   RepartitionStrategy repartitionStrategy,
                                                                   int objectsPerPartition,
                                                                   int numPartitions)

Repartition the specified RDD (or not) using the given Repartition and RepartitionStrategy settings

Type Parameters:: T - Type of the RDD
Parameters:: rdd - RDD to repartition; repartition - Setting for when repartiting is to be conducted; repartitionStrategy - Setting for how repartitioning is to be conducted; objectsPerPartition - Desired number of objects per partition; numPartitions - Total number of partitions
Returns:: Repartitioned RDD, or original RDD if no repartitioning was conducted

repartitionBalanceIfRequired

public static <T> org.apache.spark.api.java.JavaRDD<T> repartitionBalanceIfRequired(org.apache.spark.api.java.JavaRDD<T> rdd,
                                                                                    Repartition repartition,
                                                                                    int objectsPerPartition,
                                                                                    int numPartitions)

Repartition a RDD (given the Repartition setting) such that we have approximately numPartitions partitions, each of which has objectsPerPartition objects.

Type Parameters:: T - Type of RDD
Parameters:: rdd - RDD to repartition; repartition - Repartitioning setting; objectsPerPartition - Number of objects we want in each partition; numPartitions - Number of partitions to have
Returns:: Repartitioned RDD, or the original RDD if no repartitioning was performed

balancedRandomSplit
```
public static <T> org.apache.spark.api.java.JavaRDD<T>[] balancedRandomSplit(int totalObjectCount,
                                                                             int numObjectsPerSplit,
                                                                             org.apache.spark.api.java.JavaRDD<T> data)
```
Random split the specified RDD into a number of RDDs, where each has numObjectsPerSplit in them.
This similar to how RDD.randomSplit works (i.e., split via filtering), but this should result in more equal splits (instead of independent binomial sampling that is used there, based on weighting) This balanced splitting approach is important when the number of DataSet objects we want in each split is small, as random sampling variance of JavaRDD.randomSplit(double[]) is quite large relative to the number of examples in each split. Note however that this method doesn't guarantee that partitions will be balanced
Downside is we need total object count (whereas JavaRDD.randomSplit(double[]) does not). However, randomSplit requires a full pass of the data anyway (in order to do filtering upon it) so this should not add much overhead in practice

Type Parameters:

T - Generic type for the RDD

Parameters:

totalObjectCount - Total number of objects in the RDD to split

numObjectsPerSplit - Number of objects in each split

data - Data to split

Returns:

The RDD split up (without replacement) into a number of smaller RDDs

balancedRandomSplit

public static <T> org.apache.spark.api.java.JavaRDD<T>[] balancedRandomSplit(int totalObjectCount,
                                                                             int numObjectsPerSplit,
                                                                             org.apache.spark.api.java.JavaRDD<T> data,
                                                                             long rngSeed)

Equivalent to balancedRandomSplit(int, int, JavaRDD) with control over the RNG seed

balancedRandomSplit

public static <T,U> org.apache.spark.api.java.JavaPairRDD<T,U>[] balancedRandomSplit(int totalObjectCount,
                                                                                     int numObjectsPerSplit,
                                                                                     org.apache.spark.api.java.JavaPairRDD<T,U> data)

Equivalent to balancedRandomSplit(int, int, JavaRDD) but for Pair RDDs

balancedRandomSplit

public static <T,U> org.apache.spark.api.java.JavaPairRDD<T,U>[] balancedRandomSplit(int totalObjectCount,
                                                                                     int numObjectsPerSplit,
                                                                                     org.apache.spark.api.java.JavaPairRDD<T,U> data,
                                                                                     long rngSeed)

Equivalent to balancedRandomSplit(int, int, JavaRDD) but for pair RDDs, and with control over the RNG seed

listPaths

public static org.apache.spark.api.java.JavaRDD<java.lang.String> listPaths(org.apache.spark.api.java.JavaSparkContext sc,
                                                                            java.lang.String path)
                                                                     throws java.io.IOException

List of the files in the given directory (path), as a JavaRDD<String>

Parameters:: sc - Spark context; path - Path to list files in
Returns:: Paths in the directory
Throws:: java.io.IOException - If error occurs getting directory contents

shuffleExamples

public static org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> shuffleExamples(org.apache.spark.api.java.JavaRDD<org.nd4j.linalg.dataset.DataSet> rdd,
                                                                                                 int newBatchSize,
                                                                                                 int numPartitions)

Randomly shuffle the examples in each DataSet object, and recombine them into new DataSet objects with the specified BatchSize

Parameters:: rdd - DataSets to shuffle/recombine; newBatchSize - New batch size for the DataSet objects, after shuffling/recombining; numPartitions - Number of partitions to use when splitting/recombining
Returns:: A new JavaRDD, with the examples shuffled/combined in each

Class SparkUtils

Method Summary

Methods inherited from class java.lang.Object

Method Detail

checkKryoConfiguration

writeStringToFile

writeStringToFile

readStringFromFile

readStringFromFile

writeObjectToFile

writeObjectToFile

readObjectFromFile

readObjectFromFile

repartition

repartitionBalanceIfRequired

balancedRandomSplit

balancedRandomSplit

balancedRandomSplit

balancedRandomSplit

listPaths

shuffleExamples