DataVecSparkUtil

java.lang.Object
- org.datavec.spark.util.DataVecSparkUtil

public class DataVecSparkUtil
extends java.lang.Object

Utilities for using DataVec with Spark

Constructor Summary

Constructors
Constructor and Description

DataVecSparkUtil()

Constructors
Constructor and Description
`DataVecSparkUtil()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable>`	`combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc, java.lang.String path1, java.lang.String path2, PathToKeyConverter converter)` Same as `combineFilesForSequenceFile(JavaSparkContext, String, String, PathToKeyConverter, PathToKeyConverter)` but with the PathToKeyConverter used for both file sources
`static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable>`	`combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc, java.lang.String path1, java.lang.String path2, PathToKeyConverter converter1, PathToKeyConverter converter2)` This is a convenience method to combine data from separate files together (intended to write to a sequence file, using `JavaPairRDD.saveAsNewAPIHadoopFile(String, Class, Class, Class)`) A typical use case is to combine input and label data from different files, for later parsing by a RecordReader or SequenceRecordReader.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- DataVecSparkUtil
```
public DataVecSparkUtil()
```

Method Detail

combineFilesForSequenceFile

public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                                             java.lang.String path1,
                                                                                                                             java.lang.String path2,
                                                                                                                             PathToKeyConverter converter)

Same as combineFilesForSequenceFile(JavaSparkContext, String, String, PathToKeyConverter, PathToKeyConverter) but with the PathToKeyConverter used for both file sources

combineFilesForSequenceFile

public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                                             java.lang.String path1,
                                                                                                                             java.lang.String path2,
                                                                                                                             PathToKeyConverter converter1,
                                                                                                                             PathToKeyConverter converter2)

This is a convenience method to combine data from separate files together (intended to write to a sequence file, using JavaPairRDD.saveAsNewAPIHadoopFile(String, Class, Class, Class))
A typical use case is to combine input and label data from different files, for later parsing by a RecordReader or SequenceRecordReader. A typical use case is as follows:
Given two paths (directories), combine the files in these two directories into pairs.
Then, for each pair of files, convert the file contents into a BytesPairWritable, which also contains the original file paths of the files.
The assumptions are as follows:
- For every file in the first directory, there is an equivalent file in the second directory (i.e., same key)
- The pairing of files can be done based on the paths of the files; paths are mapped to a key using a PathToKeyConverter; keys are then matched to give pairs of files

Example usage: to combine all files in directory dir1 with equivalent files in dir2, by file name:

 JavaSparkContext sc = ...;
 String path1 = "/dir1";
 String path2 = "/dir2";
 PathToKeyConverter pathConverter = new PathToKeyConverterFilename();
 JavaPairRDD<Text,BytesPairWritable> toWrite = DataVecSparkUtil.combineFilesForSequenceFile(sc, path1, path2, pathConverter, pathConverter );
 String outputPath = "/my/output/path";
 toWrite.saveAsNewAPIHadoopFile(outputPath, Text.class, BytesPairWritable.class, SequenceFileOutputFormat.class);

Result: the file contexts aggregated (pairwise), written to a hadoop sequence file at /my/output/path

Parameters:: sc - Spark context; path1 - First directory (passed to JavaSparkContext.binaryFiles(path1)); path2 - Second directory (passed to JavaSparkContext.binaryFiles(path1)); converter1 - Converter, to convert file paths in first directory to a key (to allow files to be matched/paired by key); converter2 - As above, for second directory
Returns:

Class DataVecSparkUtil

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

DataVecSparkUtil

Method Detail

combineFilesForSequenceFile

combineFilesForSequenceFile