Normalization

java.lang.Object
- org.datavec.spark.transform.Normalization

```
public class Normalization
extends java.lang.Object
```
Simple dataframe based normalization. Column based transforms such as min/max scaling based on column min max and zero mean unit variance using column wise statistics.

Constructor Summary

Constructors
Constructor and Description

Normalization()

Constructors
Constructor and Description
`Normalization()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static java.util.List<org.apache.spark.sql.Row>`	`aggregate(DataRowsFacade data, java.lang.String[] columns, java.lang.String[] functions)` Aggregate based on an arbitrary list of aggregation and grouping functions
`static java.util.List<org.apache.spark.sql.Row>`	`minMaxColumns(DataRowsFacade data, java.util.List<java.lang.String> columns)` Returns the min and max of the given columns
`static java.util.List<org.apache.spark.sql.Row>`	`minMaxColumns(DataRowsFacade data, java.lang.String... columns)` Returns the min and max of the given columns.
`static DataRowsFacade`	`normalize(DataRowsFacade dataFrame)` Scale based on min,max
`static DataRowsFacade`	`normalize(DataRowsFacade dataFrame, double min, double max)` Scale based on min,max
`static DataRowsFacade`	`normalize(DataRowsFacade dataFrame, double min, double max, java.util.List<java.lang.String> skipColumns)` Scale based on min,max
`static DataRowsFacade`	`normalize(DataRowsFacade dataFrame, java.util.List<java.lang.String> skipColumns)` Scale based on min,max
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`normalize(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)` Scale all data 0 to 1
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`normalize(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data, double min, double max)` Scale based on min,max
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`normalize(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data, double min, double max, java.util.List<java.lang.String> skipColumns)` Scale based on min,max
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`normalize(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data, java.util.List<java.lang.String> skipColumns)` Scale all data 0 to 1
`static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>`	`normalizeSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)`
`static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>`	`normalizeSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data, double min, double max)` Normalize each column of a sequence, based on min/max
`static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>`	`normalizeSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data, double min, double max, java.util.List<java.lang.String> excludeColumns)` Normalize each column of a sequence, based on min/max
`static java.util.List<org.apache.spark.sql.Row>`	`stdDevMeanColumns(DataRowsFacade data, java.util.List<java.lang.String> columns)` Returns the standard deviation and mean of the given columns
`static java.util.List<org.apache.spark.sql.Row>`	`stdDevMeanColumns(DataRowsFacade data, java.lang.String... columns)` Returns the standard deviation and mean of the given columns The list returned is a list of size 2 where each row represents the standard deviation of each column and the mean of each column
`static DataRowsFacade`	`zeromeanUnitVariance(DataRowsFacade frame)` Normalize by zero mean unit variance
`static DataRowsFacade`	`zeromeanUnitVariance(DataRowsFacade frame, java.util.List<java.lang.String> skipColumns)` Normalize by zero mean unit variance
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`zeromeanUnitVariance(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)` Normalize by zero mean unit variance
`static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>>`	`zeromeanUnitVariance(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data, java.util.List<java.lang.String> skipColumns)` Normalize by zero mean unit variance
`static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>`	`zeroMeanUnitVarianceSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> sequence)` Normalize the sequence by zero mean unit variance
`static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>>`	`zeroMeanUnitVarianceSequence(Schema schema, org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> sequence, java.util.List<java.lang.String> excludeColumns)` Normalize the sequence by zero mean unit variance

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- Normalization
```
public Normalization()
```

Method Detail

zeromeanUnitVariance
```
public static DataRowsFacade zeromeanUnitVariance(DataRowsFacade frame)
```
Normalize by zero mean unit variance

Parameters:

frame - the data to normalize

Returns:

a zero mean unit variance centered rdd

zeromeanUnitVariance

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> zeromeanUnitVariance(Schema schema,
                                                                                               org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)

Normalize by zero mean unit variance

Parameters:: schema - the schema to use to create the data frame; data - the data to normalize
Returns:: a zero mean unit variance centered rdd

normalize

public static DataRowsFacade normalize(DataRowsFacade dataFrame,
                                       double min,
                                       double max)

Scale based on min,max

Parameters:: dataFrame - the dataframe to scale; min - the minimum value; max - the maximum value
Returns:: the normalized dataframe per column

normalize

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> normalize(Schema schema,
                                                                                    org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data,
                                                                                    double min,
                                                                                    double max)

Scale based on min,max

Parameters:: schema - the schema of the data to scale; data - the data to sclae; min - the minimum value; max - the maximum value
Returns:: the normalized ata

normalize
```
public static DataRowsFacade normalize(DataRowsFacade dataFrame)
```
Scale based on min,max

Parameters:

dataFrame - the dataframe to scale

Returns:

the normalized dataframe per column

normalize

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> normalize(Schema schema,
                                                                                    org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data)

Scale all data 0 to 1

Parameters:: schema - the schema of the data to scale; data - the data to scale
Returns:: the normalized ata

zeromeanUnitVariance

public static DataRowsFacade zeromeanUnitVariance(DataRowsFacade frame,
                                                  java.util.List<java.lang.String> skipColumns)

Normalize by zero mean unit variance

Parameters:: frame - the data to normalize
Returns:: a zero mean unit variance centered rdd

zeromeanUnitVariance

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> zeromeanUnitVariance(Schema schema,
                                                                                               org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data,
                                                                                               java.util.List<java.lang.String> skipColumns)

Normalize by zero mean unit variance

Parameters:: schema - the schema to use to create the data frame; data - the data to normalize
Returns:: a zero mean unit variance centered rdd

zeroMeanUnitVarianceSequence

public static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> zeroMeanUnitVarianceSequence(Schema schema,
                                                                                                                       org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> sequence)

Normalize the sequence by zero mean unit variance

Parameters:: schema - Schema of the data to normalize; sequence - Sequence data
Returns:: Normalized sequence

zeroMeanUnitVarianceSequence

public static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> zeroMeanUnitVarianceSequence(Schema schema,
                                                                                                                       org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> sequence,
                                                                                                                       java.util.List<java.lang.String> excludeColumns)

Normalize the sequence by zero mean unit variance

Parameters:: schema - Schema of the data to normalize; sequence - Sequence data; excludeColumns - List of columns to exclude from the normalization
Returns:: Normalized sequence

minMaxColumns

public static java.util.List<org.apache.spark.sql.Row> minMaxColumns(DataRowsFacade data,
                                                                     java.util.List<java.lang.String> columns)

Returns the min and max of the given columns

Parameters:: data - the data to get the max for; columns - the columns to get the
Returns:

minMaxColumns
```
public static java.util.List<org.apache.spark.sql.Row> minMaxColumns(DataRowsFacade data,
                                                                     java.lang.String... columns)
```
Returns the min and max of the given columns. The list returned is a list of size 2 where each row

Parameters:

data - the data to get the max for

columns - the columns to get the

Returns:

stdDevMeanColumns

public static java.util.List<org.apache.spark.sql.Row> stdDevMeanColumns(DataRowsFacade data,
                                                                         java.util.List<java.lang.String> columns)

Returns the standard deviation and mean of the given columns

Parameters:: data - the data to get the max for; columns - the columns to get the
Returns:

stdDevMeanColumns
```
public static java.util.List<org.apache.spark.sql.Row> stdDevMeanColumns(DataRowsFacade data,
                                                                         java.lang.String... columns)
```
Returns the standard deviation and mean of the given columns The list returned is a list of size 2 where each row represents the standard deviation of each column and the mean of each column

Parameters:

data - the data to get the standard deviation and mean for for

columns - the columns to get the

Returns:

aggregate
```
public static java.util.List<org.apache.spark.sql.Row> aggregate(DataRowsFacade data,
                                                                 java.lang.String[] columns,
                                                                 java.lang.String[] functions)
```
Aggregate based on an arbitrary list of aggregation and grouping functions

Parameters:

data - the dataframe to aggregate

columns - the columns to aggregate

functions - the functions to use

Returns:

the list of rows with the aggregated statistics. Each row will be a function with the desired columnar output in the order in which the columns were specified.

normalize

public static DataRowsFacade normalize(DataRowsFacade dataFrame,
                                       double min,
                                       double max,
                                       java.util.List<java.lang.String> skipColumns)

Scale based on min,max

Parameters:: dataFrame - the dataframe to scale; min - the minimum value; max - the maximum value
Returns:: the normalized dataframe per column

normalize

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> normalize(Schema schema,
                                                                                    org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data,
                                                                                    double min,
                                                                                    double max,
                                                                                    java.util.List<java.lang.String> skipColumns)

Scale based on min,max

Parameters:: schema - the schema of the data to scale; data - the data to scale; min - the minimum value; max - the maximum value
Returns:: the normalized ata

normalizeSequence

public static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> normalizeSequence(Schema schema,
                                                                                                            org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data)

Parameters:: schema -; data -
Returns:

normalizeSequence

public static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> normalizeSequence(Schema schema,
                                                                                                            org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data,
                                                                                                            double min,
                                                                                                            double max)

Normalize each column of a sequence, based on min/max

Parameters:: schema - Schema of the data; data - Data to normalize; min - New minimum value; max - New maximum value
Returns:: Normalized data

normalizeSequence

public static org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> normalizeSequence(Schema schema,
                                                                                                            org.apache.spark.api.java.JavaRDD<java.util.List<java.util.List<Writable>>> data,
                                                                                                            double min,
                                                                                                            double max,
                                                                                                            java.util.List<java.lang.String> excludeColumns)

Normalize each column of a sequence, based on min/max

Parameters:: schema - Schema of the data; data - Data to normalize; min - New minimum value; max - New maximum value; excludeColumns - List of columns to exclude
Returns:: Normalized data

normalize

public static DataRowsFacade normalize(DataRowsFacade dataFrame,
                                       java.util.List<java.lang.String> skipColumns)

Scale based on min,max

Parameters:: dataFrame - the dataframe to scale
Returns:: the normalized dataframe per column

normalize

public static org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> normalize(Schema schema,
                                                                                    org.apache.spark.api.java.JavaRDD<java.util.List<Writable>> data,
                                                                                    java.util.List<java.lang.String> skipColumns)

Scale all data 0 to 1

Parameters:: schema - the schema of the data to scale; data - the data to scale
Returns:: the normalized ata

Class Normalization

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Normalization

Method Detail

zeromeanUnitVariance

zeromeanUnitVariance

normalize

normalize

normalize

normalize

zeromeanUnitVariance

zeromeanUnitVariance

zeroMeanUnitVarianceSequence

zeroMeanUnitVarianceSequence

minMaxColumns

minMaxColumns

stdDevMeanColumns

stdDevMeanColumns

aggregate

normalize

normalize

normalizeSequence

normalizeSequence

normalizeSequence

normalize

normalize