preprocessing module

A dataprocessing package for data preprocess and feature engineering.

This library contains preprocessing methods for data processing and feature engineering used during data analysis and machine learning process.

Classes

DropColumns([drop_columns])

Simply delete columns specified from input dataframe.

DropNoVariance()

Delete columns which only have single unique value.

DropHighCardinality([max_categories])

Delete columns with high cardinality.

DropLowAUC([threshold])

Delete columns that have low information to predict target variable.

DropHighCorrelation([threshold])

Delete features that are highly correlated to each other.

ImputeNaN([cat_strategy, num_strategy])

Look for NaN values in the dataframe and impute by strategy such as mean, median and mode.

OneHotEncoding([drop_first])

One Hot Encoding of categorical variables.

BinarizeNaN()

Find a column with missing values, and create a new column indicating whether a value was missing (0) or not (1).

CountRowNaN()

Calculates total number of NaN in a row and create a new column to store the total.

ClipData([threshold])

Clip datasets by replacing values larger than the upper bound with upper bound and lower than the lower bound by lower bound.

GroupRareCategory([threshold])

Replace rare categories that appear in categorical columns with dummy string.

TargetMeanEncoding([k, f, smoothing])

Target Mean Encoding of categorical variables.

StandardScaling()

Standardize datasets to have mean = 0 and std = 1.

MinMaxScaling()

Rescale the fit data into range between 0 and 1.

CountEncoding()

Encode categorical variables by the count of category within the categorical column.

RankedCountEncoding()

Firstly encode categorical variables by the count of category within the categorical column.

FrequencyEncoding()

Encode categorical variables by the frequency of category within the categorical column.

RankedTargetMeanEncoding([k, f, smoothing])

Ranking with Target Mean Encoding of categorical variables.

AppendAnomalyScore([n_estimators, random_state])

Append anomaly score calculated from isolation forest.

AppendCluster([n_clusters, random_state])

Append cluster number obtained from kmeans++ clustering.

AppendClusterDistance([n_clusters, random_state])

Append cluster distance obtained from kmeans++ clustering.

AppendPrincipalComponent([n_components, …])

Append principal components obtained from PCA.

AppendArithmeticFeatures([max_features, …])

A transformer which recognizes all numerical features and create new features by arithmetic operation.

RankedEvaluationMetricEncoding([metric])

Encode categorical columns by firstly creating dummy variable, then LogisticRegression against target variable is fitted for each of the dummy variables.

AppendClassificationModel([model, probability])

Append prediction from model as a new feature.

AppendEncoder([encoder])

Append encoders in the DataLiner module.

AppendClusterTargetMean([n_clusters, …])

Append cluster number obtained from kmeans++ clustering.

PermutationImportanceTest([threshold])

Conduct permutation importance tests on features and drop features that are not effective.

UnionAppend([append_list])

Concatenates features extracted from original input data by AppendXXX in the DataLiner package.

Functions

load_titanic()

Load train and test data for titanic datasets.

class preprocessing.DropColumns(drop_columns=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Simply delete columns specified from input dataframe.

Parameters

drop_columns (list) – List of feature names which will be droped from input dataframe. For single columns, string can also be used. (default=None)

Methods

fit(X[, y])

Fit transformer by checking X is a pandas DataFrame.

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Fit transformer by checking X is a pandas DataFrame.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.DropNoVariance[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Delete columns which only have single unique value.

Methods

fit(X[, y])

Fit transformer by deleting column with single unique value.

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Fit transformer by deleting column with single unique value.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.DropHighCardinality(max_categories=50)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Delete columns with high cardinality. Basically means dropping column with too many categories.

Parameters

max_categories (int) – Maximum number of categories to be permitted in a column. If number of categories in a certain column exceeds this value, that column will be deleted. (default=50)

Methods

fit(X[, y])

Fit transformer by deleting column with high cardinality.

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Fit transformer by deleting column with high cardinality.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.DropLowAUC(threshold=0.51)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Delete columns that have low information to predict target variable. This class calculate roc_auc by fitting all features in the input array one by one against target feature using Logistic Regression and drop features with roc_auc below threshold specified. Missing values will be replaced by mean and categorical feature will be converted to dummy variables by one hot encoding with missing values filled with mode.

Parameters

threshold (float) – Threshold value for roc_auc. Feature with roc_auc below this value will be deleted. (default=0.51)

Methods

fit(X[, y])

Fit transformer by fitting each feature with Logistic Regression and storing features with roc_auc less than threshold

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Fit transformer by fitting each feature with Logistic Regression and storing features with roc_auc less than threshold

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.DropHighCorrelation(threshold=0.95)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Delete features that are highly correlated to each other. Best correlated feature against target variable will be selected from the highly correlated pairs within X.

Parameters

threshold (float) – Threshold value for Pearson’s correlation coefficient. (default=0.95)

Methods

fit(X[, y])

Fit transformer by identifying highly correlated variable pairs and dropping one that is less correlated to the target variable.

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Fit transformer by identifying highly correlated variable pairs and dropping one that is less correlated to the target variable. Missing values will be imputed by mean.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.ImputeNaN(cat_strategy='mode', num_strategy='mean')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Look for NaN values in the dataframe and impute by strategy such as mean, median and mode.

Parameters
  • cat_strategy (string) – Strategy for imputing NaN exist in categorical columns. If any other string apart from mode is specified, the NaN will be imputed by fixed string name ImputedNaN. (default=’mode’)

  • num_strategy (string) – Strategy for imputing NaN exist in numerical columns. Either mean, median or mode can be specified and if any other string is specified, mean imputation will be employed. (default=’mean’)

Methods

fit(X[, y])

Fit transformer by identifying numerical and categorical columns.

transform(X)

Transform X by imputing with values obtained from fitting stage.

fit(X, y=None)[source]

Fit transformer by identifying numerical and categorical columns. Then, based on the strategy fit will store the values used for NaN existing in each columns.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by imputing with values obtained from fitting stage.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.OneHotEncoding(drop_first=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

One Hot Encoding of categorical variables.

Parameters

drop_first (bool) – Whether to drop first column after one hot encoding in order to avoid multi-collinearity. (default=True)

Methods

fit(X[, y])

Fit transformer by getting column names after one hot encoding.

transform(X)

Transform X by one hot encoding.

fit(X, y=None)[source]

Fit transformer by getting column names after one hot encoding.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by one hot encoding. This will drop new columns in the input and impute with dummy column with all values = 0 for column that has been deleted from fitting stage.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.BinarizeNaN[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Find a column with missing values, and create a new column indicating whether a value was missing (0) or not (1).

Methods

fit(X[, y])

Fit transformer by getting column names that contains NaN

transform(X)

Transform by checking for columns containing NaN value both during the fitting and transforming stage, then binalizing NaN to a new column.

fit(X, y=None)[source]

Fit transformer by getting column names that contains NaN

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by checking for columns containing NaN value both during the fitting and transforming stage, then binalizing NaN to a new column.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.CountRowNaN[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Calculates total number of NaN in a row and create a new column to store the total.

Methods

fit(X[, y])

Fit transformer by getting column names during fit.

transform(X)

Transform by checking for columns that exists in both the fitting and transforming stage.

fit(X, y=None)[source]

Fit transformer by getting column names during fit.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by checking for columns that exists in both the fitting and transforming stage. Then adds up number of missing values in row direction.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.ClipData(threshold=0.99)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Clip datasets by replacing values larger than the upper bound with upper bound and lower than the lower bound by lower bound. Missing values will be ignored.

Parameters

threshold (float) – Threshold value for to define upper and lower bound. For example, 0.99 will imply upper bound at 99% percentile annd lower bound at 1% percentile. (default=0.99)

Methods

fit(X[, y])

Fit transformer to get upper bound and lower bound for numerical columns.

transform(X)

Transform by clipping numerical data

fit(X, y=None)[source]

Fit transformer to get upper bound and lower bound for numerical columns.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by clipping numerical data

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.GroupRareCategory(threshold=0.01)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Replace rare categories that appear in categorical columns with dummy string.

Parameters

threshold (float) – Threshold value for defining “rare” category. For example, 0.01 will imply 1% of the total number of data as “rare”. (default=0.01)

Methods

fit(X[, y])

Fit transformer to define and store rare categories to be replaced.

transform(X)

Transform by replacing rare category with dummy string.

fit(X, y=None)[source]

Fit transformer to define and store rare categories to be replaced.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing rare category with dummy string.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.TargetMeanEncoding(k=0, f=1, smoothing=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Target Mean Encoding of categorical variables. Missing values will be treated as one of the categories.

Parameters
  • k (float) – hyperparameter for sigmoid function (default=0.0)

  • f (float) – hyperparameter for sigmoid function (default=1.0)

  • smoothing (float) – Whether to smooth target mean with global mean using sigmoid function. Do not recommend smoothing=False. (default=0.01)

Methods

fit(X[, y])

Fit transformer to define and store target mean smoothed target mean for categorical variables.

transform(X)

Transform by replacing categories with smoothed target mean

fit(X, y=None)[source]

Fit transformer to define and store target mean smoothed target mean for categorical variables.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing categories with smoothed target mean

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.StandardScaling[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Standardize datasets to have mean = 0 and std = 1. Note this will only standardize numerical data and ignore missing values during computation.

Methods

fit(X[, y])

Fit transformer to get mean and std for each numerical features.

transform(X)

Transform by subtracting mean and dividing by std.

fit(X, y=None)[source]

Fit transformer to get mean and std for each numerical features.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by subtracting mean and dividing by std.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.MinMaxScaling[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Rescale the fit data into range between 0 and 1. Note this will only standardize numerical data and ignore missing values during computation. If there are values larger/smaller than fit data in the transform data, the value will be larger than 1 or less than 0.

Methods

fit(X[, y])

Fit transformer to get min and max values for each numerical features.

transform(X)

Transform by subtracting min and dividing by max-min.

fit(X, y=None)[source]

Fit transformer to get min and max values for each numerical features.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by subtracting min and dividing by max-min.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.CountEncoding[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Encode categorical variables by the count of category within the categorical column.

Methods

fit(X[, y])

Fit transformer to define categorical variables and obtain occurrence of each categories.

transform(X)

Transform by replacing categories with counts

fit(X, y=None)[source]

Fit transformer to define categorical variables and obtain occurrence of each categories.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing categories with counts

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.RankedCountEncoding[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Firstly encode categorical variables by the count of category within the categorical column. Then, counts are ranked in descending order and the ranks are used to encode category columns. Even in case there are categories with same counts, ranking will be based on the index and therefore the categories will be distinguished. RankedFrequencyEncoding is not provided as the result will be identical to this class.

Methods

fit(X[, y])

Fit transformer to define categorical variables and obtain ranking of category counts.

transform(X)

Transform by replacing categories with ranks

fit(X, y=None)[source]

Fit transformer to define categorical variables and obtain ranking of category counts.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing categories with ranks

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.FrequencyEncoding[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Encode categorical variables by the frequency of category within the categorical column.

Methods

fit(X[, y])

Fit transformer to define categorical variables and obtain frequency of each categories.

transform(X)

Transform by replacing categories with frequency

fit(X, y=None)[source]

Fit transformer to define categorical variables and obtain frequency of each categories.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing categories with frequency

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.RankedTargetMeanEncoding(k=0, f=1, smoothing=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Ranking with Target Mean Encoding of categorical variables. Missing values will be treated as one of the categories. This will treat Categories with same target mean separately as the rank is obtained from index once sorted by target mean.

Parameters
  • k (float) – hyperparameter for sigmoid function (default=0.0)

  • f (float) – hyperparameter for sigmoid function (default=1.0)

  • smoothing (float) – Whether to smooth target mean with global mean using sigmoid function. Do not recommend smoothing=False. (default=0.01)

Methods

fit(X[, y])

Fit transformer to define and store target mean smoothed target mean for categorical variables.

transform(X)

Transform by replacing categories with rank

fit(X, y=None)[source]

Fit transformer to define and store target mean smoothed target mean for categorical variables. Then, ranking is created based on target mean.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform by replacing categories with rank

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendAnomalyScore(n_estimators=100, random_state=1234)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append anomaly score calculated from isolation forest. Since IsolationForest needs to be fitted, category columns must first be encoded to numerical values.

Parameters
  • n_estimators (int) – Number of base estimators in the Isolation Forest ensemble. (default=100)

  • random_state (int) – random_state for Isolation Forest (default=1234)

Methods

fit(X[, y])

Fit Isolation Forest

transform(X)

Transform X by appending anomaly score

fit(X, y=None)[source]

Fit Isolation Forest

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending anomaly score

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendCluster(n_clusters=8, random_state=1234)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append cluster number obtained from kmeans++ clustering. For clustering categorical variables need to be converted to numerical data.

Parameters
  • n_clusters (int) – Number of clusters (default=8)

  • random_state (int) – random_state for KMeans (default=1234)

Methods

fit(X[, y])

Fit KMeans Clustering

transform(X)

Transform X by appending cluster number

fit(X, y=None)[source]

Fit KMeans Clustering

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending cluster number

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendClusterDistance(n_clusters=8, random_state=1234)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append cluster distance obtained from kmeans++ clustering. For clustering categorical variables need to be converted to numerical data.

Parameters
  • n_clusters (int) – Number of clusters (default=8)

  • random_state (int) – random_state for KMeans (default=1234)

Methods

fit(X[, y])

Fit KMeans Clustering

transform(X)

Transform X by appending cluster distance

fit(X, y=None)[source]

Fit KMeans Clustering

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending cluster distance

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendPrincipalComponent(n_components=5, random_state=1234)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append principal components obtained from PCA. For pca categorical variables need to be converted to numerical data. Also, data should be standardized beforehand.

Parameters
  • n_components (int) – Number of principal components (default=5)

  • random_state (int) – random_state for PCA (default=1234)

Methods

fit(X[, y])

Fit PCA

transform(X)

Transform X by appending principal components

fit(X, y=None)[source]

Fit PCA

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending principal components

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendArithmeticFeatures(max_features=50, metric='roc_auc', operation='multiply', replace_zero=0.001)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A transformer which recognizes all numerical features and create new features by arithmetic operation. Newly created features are evaluated individually by fitting Logistic Regression against the target variable and only new features with higher eval metric than feature pairs will be newly added to the data. Missing values need to be imputed beforehand.

Parameters
  • max_features (int) – Number of numerical features to test combinations. If number of numerical features in the data exceeds this value, transformer will raise an exception. (default=50)

  • metric (string) – Metrics to evaluate feature. Sklearn default metrics can be used. (default=’roc_auc’)

  • operation (string) – Type of arithmetic operations. ‘add’, ‘subtract’, ‘multiply’, ‘divide’ can be used. (default=’multiply’)

  • replace_zero (float) – Value to replace 0 when operation=’divide’ . Do not use 0 as it may cause ZeroDivisionError.(default=0.001)

Methods

fit(X[, y])

Fit transformer by fitting each feature with Logistic Regression and storing features with eval metrics higher than the max of existing features.

transform(X)

Transform X by creating new feature using pairs identified during fit.

fit(X, y=None)[source]

Fit transformer by fitting each feature with Logistic Regression and storing features with eval metrics higher than the max of existing features.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by creating new feature using pairs identified during fit.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.RankedEvaluationMetricEncoding(metric='roc_auc')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Encode categorical columns by firstly creating dummy variable, then LogisticRegression against target variable is fitted for each of the dummy variables. Evaluation metric such as accuracy or roc_auc is calculated and ranked. Finally, categories are encoded with its rank. It is strongly recommended to conduct DropHighCardinality or GroupRareCategory before using this encoding as this encoder will fit Logistic Regression for ALL categories with 5-fold.

Parameters

metric (string) – Metrics to evaluate feature. Sklearn default metrics can be used. (default=’roc_auc’)

Methods

fit(X[, y])

Fit transformer by creating dummy variable and fitting LogisticRegression.

transform(X)

Transform X by replacing categories with its evaluation metric

fit(X, y=None)[source]

Fit transformer by creating dummy variable and fitting LogisticRegression.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by replacing categories with its evaluation metric

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendClassificationModel(model=None, probability=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append prediction from model as a new feature. Model must have fit and predict methods and it should only predict a single label. In case the model has a predict_proba method, option probability can be used to append class probability instead of class labels. predict_proba method must return class probability for 0 as first column and 1 as second column.

Parameters
  • model (object) – Any model that is in line with sklearn classification model, meaning it implements fit and predict. (default=None)

  • probability (bool) – Whether to class probability instead of class labels. If True, model must have predict_proba method implemented.(default=False)

Methods

fit(X[, y])

Fit transformer by fitting model specified.

transform(X)

Transform X by predicting with fitted model.

fit(X, y=None)[source]

Fit transformer by fitting model specified.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by predicting with fitted model.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendEncoder(encoder=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append encoders in the DataLiner module. Encoders in DataLiner will automatically replace categorical values, but by wrapping DataLiner Encoders with this class, encoded results will be appended as a new feature and original categorical columns will remain. Regardless of whether the Encoder will require target column or not, this class will require target column.

Parameters

encoder (object) – DataLiner Encoders.(default=None)

Methods

fit(X[, y])

Fit transformer by fitting encoder specified

transform(X)

Transform X by appending encoded category

fit(X, y=None)[source]

Fit transformer by fitting encoder specified

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Input Series for target variable

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending encoded category

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.AppendClusterTargetMean(n_clusters=8, random_state=1234)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Append cluster number obtained from kmeans++ clustering. Then each cluster number is replaced with target mean. For clustering categorical variables need to be converted to numerical data.

Parameters
  • n_clusters (int) – Number of clusters (default=8)

  • random_state (int) – random_state for KMeans (default=1234)

Methods

fit(X[, y])

Fit KMeans Clustering and obtain target mean

transform(X)

Transform X by appending cluster mean

fit(X, y=None)[source]

Fit KMeans Clustering and obtain target mean

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by appending cluster mean

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.PermutationImportanceTest(threshold=0.0001)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Conduct permutation importance tests on features and drop features that are not effective. Basically it will firstly fit entire data, then randomly shuffle each feature’s data and evaluate the metrics for both cases. If shuffled case has no difference in the evaluation then that means the feature is not effective in prediction.

Parameters

threshold (float) – Average difference in roc_auc between original and shuffled dataset. Higher the value, more features will be dropped. (default=0.0001)

Methods

fit(X[, y])

Conduct permutation importance test and store drop features.

transform(X)

Transform X by dropping columns specified in drop_columns

fit(X, y=None)[source]

Conduct permutation importance test and store drop features.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by dropping columns specified in drop_columns

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

class preprocessing.UnionAppend(append_list=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Concatenates features extracted from original input data by AppendXXX in the DataLiner package. Normally by applying AppendXXX in pipeline input data will be treated in series and therefore appended feature will be used in the next AppendXXX. By wrapping list of AppendXXX with this class, append features will be processed in parallel and therefore each AppendXXX class will only use the original input features.

Parameters

append_list (list) – List of AppendXXX in DataLiner package. (default=None)

Methods

fit(X[, y])

Fit transformer by verifying AppendXXX specified.

transform(X)

Transform X by transforming X in parallel and appending to original input data.

fit(X, y=None)[source]

Fit transformer by verifying AppendXXX specified.

Parameters
  • X (pandas.DataFrame) – Input dataframe

  • y (pandas.Series) – Ignored. (default=None)

Returns

fitted object (self)

Return type

object

transform(X)[source]

Transform X by transforming X in parallel and appending to original input data.

Parameters

X (pandas.DataFrame) – Input dataframe

Returns

Transformed input DataFrame

Return type

pandas.DataFrame

preprocessing.load_titanic()[source]

Load train and test data for titanic datasets.

Returns

train_features, train_target, test_features

Return type

pandas.DataFrame, pandas.Series, pandas.DataFrame