public class DataFrames
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
SEQUENCE_INDEX_COLUMN |
static java.lang.String |
SEQUENCE_UUID_COLUMN |
Modifier and Type | Method and Description |
---|---|
static org.apache.spark.sql.types.StructType |
fromSchema(Schema schema)
Convert a datavec schema to a
struct type in spark
|
static org.apache.spark.sql.types.StructType |
fromSchemaSequence(Schema schema)
Convert the DataVec sequence schema to a StructType for Spark, for example for use in
toDataFrameSequence(Schema, JavaRDD) }
Note: as per toDataFrameSequence(Schema, JavaRDD) }, the StructType has two additional columns added to it:- Column 0: Sequence UUID (name: SEQUENCE_UUID_COLUMN ) - a UUID for the original sequence- Column 1: Sequence index (name: SEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.These two columns are required if the data is to be converted back into a sequence at a later point, for example using toRecordsSequence(DataRowsFacade) |
static Schema |
fromStructType(org.apache.spark.sql.types.StructType structType)
Create a datavec schema
from a struct type
|
static org.apache.spark.sql.Column |
max(DataRowsFacade dataFrame,
java.lang.String columnName)
Max for a column
|
static org.apache.spark.sql.Column |
mean(DataRowsFacade dataFrame,
java.lang.String columnName)
Mean for a column
|
static org.apache.spark.sql.Column |
min(DataRowsFacade dataFrame,
java.lang.String columnName)
MIn for a column
|
static java.util.List<Writable> |
rowToWritables(Schema schema,
org.apache.spark.sql.Row row)
Convert a given Row to a list of writables, given the specified Schema
|
static org.apache.spark.sql.Column |
std(DataRowsFacade dataFrame,
java.lang.String columnName)
Standard deviation for a column
|
static java.lang.String[] |
toArray(java.util.List<java.lang.String> list)
Convert a string list into a array
|
static java.util.List<org.apache.spark.sql.Column> |
toColumn(java.util.List<java.lang.String> columns)
Convert a list of string names
to columns
|
static org.apache.spark.sql.Column[] |
toColumns(java.lang.String... columns)
Convert an array of strings
to column names
|
static DataRowsFacade |
toDataFrame(Schema schema,
org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)
Creates a data frame from a collection of writables
rdd given a schema
|
static DataRowsFacade |
toDataFrameSequence(Schema schema,
org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)
Convert the given sequence data set to a DataFrame.
Note: The resulting DataFrame has two additional columns added to it: - Column 0: Sequence UUID (name: SEQUENCE_UUID_COLUMN ) - a UUID for the original sequence- Column 1: Sequence index (name: SEQUENCE_INDEX_COLUMN - an index (integer, starting at 0) for the position
of this record in the original time series.These two columns are required if the data is to be converted back into a sequence at a later point, for example using toRecordsSequence(DataRowsFacade) |
static java.util.List<java.lang.String> |
toList(java.lang.String[] input)
Convert a string array into a list
|
static org.nd4j.linalg.api.ndarray.INDArray |
toMatrix(java.util.List<org.apache.spark.sql.Row> rows)
Convert a list of rows to a matrix
|
static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>> |
toRecords(DataRowsFacade dataFrame)
Create a compatible schema
and rdd for datavec
|
static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>> |
toRecordsSequence(DataRowsFacade dataFrame)
Convert the given DataFrame to a sequence
Note: It is assumed here that the DataFrame has been created by toDataFrameSequence(Schema, JavaRDD) . |
static org.apache.spark.sql.Column |
var(DataRowsFacade dataFrame,
java.lang.String columnName)
Standard deviation for a column
|
public static final java.lang.String SEQUENCE_UUID_COLUMN
public static final java.lang.String SEQUENCE_INDEX_COLUMN
public static org.apache.spark.sql.Column std(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame
- the dataframe to
get the column fromcolumnName
- the name of the column to get the standard
deviation forpublic static org.apache.spark.sql.Column var(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame
- the dataframe to
get the column fromcolumnName
- the name of the column to get the standard
deviation forpublic static org.apache.spark.sql.Column min(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame
- the dataframe to
get the column fromcolumnName
- the name of the column to get the min forpublic static org.apache.spark.sql.Column max(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame
- the dataframe to
get the column fromcolumnName
- the name of the column
to get the max forpublic static org.apache.spark.sql.Column mean(DataRowsFacade dataFrame, java.lang.String columnName)
dataFrame
- the dataframe to
get the column froncolumnName
- the name of the column to get the mean forpublic static org.apache.spark.sql.types.StructType fromSchema(Schema schema)
schema
- the schema to convertpublic static org.apache.spark.sql.types.StructType fromSchemaSequence(Schema schema)
toDataFrameSequence(Schema, JavaRDD)
}
Note: as per toDataFrameSequence(Schema, JavaRDD)
}, the StructType has two additional columns added to it:SEQUENCE_UUID_COLUMN
) - a UUID for the original sequenceSEQUENCE_INDEX_COLUMN
- an index (integer, starting at 0) for the position
of this record in the original time series.toRecordsSequence(DataRowsFacade)
schema
- Schema to convertpublic static Schema fromStructType(org.apache.spark.sql.types.StructType structType)
structType
- the struct type to create the schema frompublic static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>> toRecords(DataRowsFacade dataFrame)
dataFrame
- the dataframe to convertpublic static Pair<Schema,org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>> toRecordsSequence(DataRowsFacade dataFrame)
toDataFrameSequence(Schema, JavaRDD)
.
In particular:
Typical use: Normalization via the Normalization
static methods
dataFrame
- Data frame to convertList<List<Writable>>
formpublic static DataRowsFacade toDataFrame(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)
schema
- the schema to usedata
- the data to convertpublic static DataRowsFacade toDataFrameSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)
SEQUENCE_UUID_COLUMN
) - a UUID for the original sequenceSEQUENCE_INDEX_COLUMN
- an index (integer, starting at 0) for the position
of this record in the original time series.toRecordsSequence(DataRowsFacade)
schema
- Schema for the datadata
- Sequence data to convert to a DataFramepublic static java.util.List<Writable> rowToWritables(Schema schema, org.apache.spark.sql.Row row)
schema
- Schema for the datarow
- Row of datapublic static java.util.List<java.lang.String> toList(java.lang.String[] input)
input
- the input to create the list frompublic static java.lang.String[] toArray(java.util.List<java.lang.String> list)
list
- the input to create the array frompublic static org.nd4j.linalg.api.ndarray.INDArray toMatrix(java.util.List<org.apache.spark.sql.Row> rows)
rows
- the list of rows to convertpublic static java.util.List<org.apache.spark.sql.Column> toColumn(java.util.List<java.lang.String> columns)
columns
- the columns to convertpublic static org.apache.spark.sql.Column[] toColumns(java.lang.String... columns)
columns
- the columns to convert