public class DataFrames
extends java.lang.Object
| Modifier and Type | Field and Description |
|---|---|
static java.lang.String |
SEQUENCE_INDEX_COLUMN |
static java.lang.String |
SEQUENCE_UUID_COLUMN |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.spark.sql.types.StructType |
fromSchema(Schema schema)
Convert a datavec schema to a
struct type in spark
|
static org.apache.spark.sql.types.StructType |
fromSchemaSequence(Schema schema)
Convert the DataVec sequence schema to a StructType for Spark, for example for use in
toDataFrameSequence(Schema, JavaRDD)}
Note: as per toDataFrameSequence(Schema, JavaRDD)}, the StructType has two additional columns added to it:- Column 0: Sequence UUID (name: SEQUENCE_UUID_COLUMN) - a UUID for the original sequence- Column 1: Sequence index (name: SEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.These two columns are required if the data is to be converted back into a sequence at a later point, for example using toRecordsSequence(DataRowsFacade) |
static Schema |
fromStructType(org.apache.spark.sql.types.StructType structType)
Create a datavec schema
from a struct type
|
static org.apache.spark.sql.Column |
max(DataRowsFacade dataFrame,
java.lang.String columnName)
Max for a column
|
static org.apache.spark.sql.Column |
mean(DataRowsFacade dataFrame,
java.lang.String columnName)
Mean for a column
|
static org.apache.spark.sql.Column |
min(DataRowsFacade dataFrame,
java.lang.String columnName)
MIn for a column
|
static java.util.List<Writable> |
rowToWritables(Schema schema,
org.apache.spark.sql.Row row)
Convert a given Row to a list of writables, given the specified Schema
|
static org.apache.spark.sql.Column |
std(DataRowsFacade dataFrame,
java.lang.String columnName)
Standard deviation for a column
|
static java.lang.String[] |
toArray(java.util.List<java.lang.String> list)
Convert a string list into a array
|
static java.util.List<org.apache.spark.sql.Column> |
toColumn(java.util.List<java.lang.String> columns)
Convert a list of string names
to columns
|
static org.apache.spark.sql.Column[] |
toColumns(java.lang.String... columns)
Convert an array of strings
to column names
|
static DataRowsFacade |
toDataFrame(Schema schema,
org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)
Creates a data frame from a collection of writables
rdd given a schema
|
static DataRowsFacade |
toDataFrameSequence(Schema schema,
org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)
Convert the given sequence data set to a DataFrame.
Note: The resulting DataFrame has two additional columns added to it: - Column 0: Sequence UUID (name: SEQUENCE_UUID_COLUMN) - a UUID for the original sequence- Column 1: Sequence index (name: SEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.These two columns are required if the data is to be converted back into a sequence at a later point, for example using toRecordsSequence(DataRowsFacade) |
static java.util.List<java.lang.String> |
toList(java.lang.String[] input)
Convert a string array into a list
|
static org.nd4j.linalg.api.ndarray.INDArray |
toMatrix(java.util.List<org.apache.spark.sql.Row> rows)
Convert a list of rows to a matrix
|
static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>> |
toRecords(DataRowsFacade dataFrame)
Create a compatible schema
and rdd for datavec
|
static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>> |
toRecordsSequence(DataRowsFacade dataFrame)
Convert the given DataFrame to a sequence
Note: It is assumed here that the DataFrame has been created by toDataFrameSequence(Schema, JavaRDD). |
static org.apache.spark.sql.Column |
var(DataRowsFacade dataFrame,
java.lang.String columnName)
Standard deviation for a column
|
public static final java.lang.String SEQUENCE_UUID_COLUMN
public static final java.lang.String SEQUENCE_INDEX_COLUMN
public static org.apache.spark.sql.Column std(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame - the dataframe to
get the column fromcolumnName - the name of the column to get the standard
deviation forpublic static org.apache.spark.sql.Column var(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame - the dataframe to
get the column fromcolumnName - the name of the column to get the standard
deviation forpublic static org.apache.spark.sql.Column min(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame - the dataframe to
get the column fromcolumnName - the name of the column to get the min forpublic static org.apache.spark.sql.Column max(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame - the dataframe to
get the column fromcolumnName - the name of the column
to get the max forpublic static org.apache.spark.sql.Column mean(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame - the dataframe to
get the column froncolumnName - the name of the column to get the mean forpublic static org.apache.spark.sql.types.StructType fromSchema(Schema schema)
schema - the schema to convertpublic static org.apache.spark.sql.types.StructType fromSchemaSequence(Schema schema)
toDataFrameSequence(Schema, JavaRDD)}
Note: as per toDataFrameSequence(Schema, JavaRDD)}, the StructType has two additional columns added to it:SEQUENCE_UUID_COLUMN) - a UUID for the original sequenceSEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.toRecordsSequence(DataRowsFacade)schema - Schema to convertpublic static Schema fromStructType(org.apache.spark.sql.types.StructType structType)
structType - the struct type to create the schema frompublic static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>> toRecords(DataRowsFacade dataFrame)
dataFrame - the dataframe to convertpublic static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>> toRecordsSequence(DataRowsFacade dataFrame)
toDataFrameSequence(Schema, JavaRDD).
In particular:
Typical use: Normalization via the Normalization static methods
dataFrame - Data frame to convertList<List<Writable>> formpublic static DataRowsFacade toDataFrame(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)
schema - the schema to usedata - the data to convertpublic static DataRowsFacade toDataFrameSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)
SEQUENCE_UUID_COLUMN) - a UUID for the original sequenceSEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.toRecordsSequence(DataRowsFacade)schema - Schema for the datadata - Sequence data to convert to a DataFramepublic static java.util.List<Writable> rowToWritables(Schema schema, org.apache.spark.sql.Row row)
schema - Schema for the datarow - Row of datapublic static java.util.List<java.lang.String> toList(java.lang.String[] input)
input - the input to create the list frompublic static java.lang.String[] toArray(java.util.List<java.lang.String> list)
list - the input to create the array frompublic static org.nd4j.linalg.api.ndarray.INDArray toMatrix(java.util.List<org.apache.spark.sql.Row> rows)
rows - the list of rows to convertpublic static java.util.List<org.apache.spark.sql.Column> toColumn(java.util.List<java.lang.String> columns)
columns - the columns to convertpublic static org.apache.spark.sql.Column[] toColumns(java.lang.String... columns)
columns - the columns to convert