public class DataVecSparkUtil
extends java.lang.Object
Constructor and Description |
---|
DataVecSparkUtil() |
Modifier and Type | Method and Description |
---|---|
static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> |
combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc,
java.lang.String path1,
java.lang.String path2,
PathToKeyConverter converter)
Same as
combineFilesForSequenceFile(JavaSparkContext, String, String, PathToKeyConverter, PathToKeyConverter)
but with the PathToKeyConverter used for both file sources |
static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> |
combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc,
java.lang.String path1,
java.lang.String path2,
PathToKeyConverter converter1,
PathToKeyConverter converter2)
This is a convenience method to combine data from separate files together (intended to write to a sequence file, using
JavaPairRDD.saveAsNewAPIHadoopFile(String, Class, Class, Class) )A typical use case is to combine input and label data from different files, for later parsing by a RecordReader or SequenceRecordReader. |
public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc, java.lang.String path1, java.lang.String path2, PathToKeyConverter converter)
combineFilesForSequenceFile(JavaSparkContext, String, String, PathToKeyConverter, PathToKeyConverter)
but with the PathToKeyConverter used for both file sourcespublic static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.Text,BytesPairWritable> combineFilesForSequenceFile(org.apache.spark.api.java.JavaSparkContext sc, java.lang.String path1, java.lang.String path2, PathToKeyConverter converter1, PathToKeyConverter converter2)
JavaPairRDD.saveAsNewAPIHadoopFile(String, Class, Class, Class)
)BytesPairWritable
, which also contains
the original file paths of the files.PathToKeyConverter
;
keys are then matched to give pairs of filesdir1
with equivalent files in dir2
, by file name:
JavaSparkContext sc = ...;
String path1 = "/dir1";
String path2 = "/dir2";
PathToKeyConverter pathConverter = new PathToKeyConverterFilename();
JavaPairRDD<Text,BytesPairWritable> toWrite = DataVecSparkUtil.combineFilesForSequenceFile(sc, path1, path2, pathConverter, pathConverter );
String outputPath = "/my/output/path";
toWrite.saveAsNewAPIHadoopFile(outputPath, Text.class, BytesPairWritable.class, SequenceFileOutputFormat.class);
Result: the file contexts aggregated (pairwise), written to a hadoop sequence file at /my/output/pathsc
- Spark contextpath1
- First directory (passed to JavaSparkContext.binaryFiles(path1))path2
- Second directory (passed to JavaSparkContext.binaryFiles(path1))converter1
- Converter, to convert file paths in first directory to a key (to allow files to be matched/paired by key)converter2
- As above, for second directory