preprocessing module¶
A dataprocessing package for data preprocess and feature engineering.
This library contains preprocessing methods for data processing and feature engineering used during data analysis and machine learning process.
Classes
|
Simply delete columns specified from input dataframe. |
Delete columns which only have single unique value. |
|
|
Delete columns with high cardinality. |
|
Delete columns that have low information to predict target variable. |
|
Delete features that are highly correlated to each other. |
|
Look for NaN values in the dataframe and impute by strategy such as mean, median and mode. |
|
One Hot Encoding of categorical variables. |
Find a column with missing values, and create a new column indicating whether a value was missing (0) or not (1). |
|
Calculates total number of NaN in a row and create a new column to store the total. |
|
|
Clip datasets by replacing values larger than the upper bound with upper bound and lower than the lower bound by lower bound. |
|
Replace rare categories that appear in categorical columns with dummy string. |
|
Target Mean Encoding of categorical variables. |
Standardize datasets to have mean = 0 and std = 1. |
|
Rescale the fit data into range between 0 and 1. |
|
Encode categorical variables by the count of category within the categorical column. |
|
Firstly encode categorical variables by the count of category within the categorical column. |
|
Encode categorical variables by the frequency of category within the categorical column. |
|
|
Ranking with Target Mean Encoding of categorical variables. |
|
Append anomaly score calculated from isolation forest. |
|
Append cluster number obtained from kmeans++ clustering. |
|
Append cluster distance obtained from kmeans++ clustering. |
|
Append principal components obtained from PCA. |
|
A transformer which recognizes all numerical features and create new features by arithmetic operation. |
|
Encode categorical columns by firstly creating dummy variable, then LogisticRegression against target variable is fitted for each of the dummy variables. |
|
Append prediction from model as a new feature. |
|
Append encoders in the DataLiner module. |
|
Append cluster number obtained from kmeans++ clustering. |
|
Conduct permutation importance tests on features and drop features that are not effective. |
|
Concatenates features extracted from original input data by AppendXXX in the DataLiner package. |
Functions
Load train and test data for titanic datasets. |
-
class
preprocessing.
DropColumns
(drop_columns=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Simply delete columns specified from input dataframe.
- Parameters
drop_columns (list) – List of feature names which will be droped from input dataframe. For single columns, string can also be used. (default=None)
Methods
fit
(X[, y])Fit transformer by checking X is a pandas DataFrame.
transform
(X)Transform X by dropping columns specified in drop_columns
-
class
preprocessing.
DropNoVariance
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Delete columns which only have single unique value.
Methods
fit
(X[, y])Fit transformer by deleting column with single unique value.
transform
(X)Transform X by dropping columns specified in drop_columns
-
class
preprocessing.
DropHighCardinality
(max_categories=50)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Delete columns with high cardinality. Basically means dropping column with too many categories.
- Parameters
max_categories (int) – Maximum number of categories to be permitted in a column. If number of categories in a certain column exceeds this value, that column will be deleted. (default=50)
Methods
fit
(X[, y])Fit transformer by deleting column with high cardinality.
transform
(X)Transform X by dropping columns specified in drop_columns
-
class
preprocessing.
DropLowAUC
(threshold=0.51)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Delete columns that have low information to predict target variable. This class calculate roc_auc by fitting all features in the input array one by one against target feature using Logistic Regression and drop features with roc_auc below threshold specified. Missing values will be replaced by mean and categorical feature will be converted to dummy variables by one hot encoding with missing values filled with mode.
- Parameters
threshold (float) – Threshold value for roc_auc. Feature with roc_auc below this value will be deleted. (default=0.51)
Methods
fit
(X[, y])Fit transformer by fitting each feature with Logistic Regression and storing features with roc_auc less than threshold
transform
(X)Transform X by dropping columns specified in drop_columns
-
class
preprocessing.
DropHighCorrelation
(threshold=0.95)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Delete features that are highly correlated to each other. Best correlated feature against target variable will be selected from the highly correlated pairs within X.
- Parameters
threshold (float) – Threshold value for Pearson’s correlation coefficient. (default=0.95)
Methods
fit
(X[, y])Fit transformer by identifying highly correlated variable pairs and dropping one that is less correlated to the target variable.
transform
(X)Transform X by dropping columns specified in drop_columns
-
fit
(X, y=None)[source]¶ Fit transformer by identifying highly correlated variable pairs and dropping one that is less correlated to the target variable. Missing values will be imputed by mean.
- Parameters
X (pandas.DataFrame) – Input dataframe
y (pandas.Series) – Ignored. (default=None)
- Returns
fitted object (self)
- Return type
object
-
class
preprocessing.
ImputeNaN
(cat_strategy='mode', num_strategy='mean')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Look for NaN values in the dataframe and impute by strategy such as mean, median and mode.
- Parameters
cat_strategy (string) – Strategy for imputing NaN exist in categorical columns. If any other string apart from mode is specified, the NaN will be imputed by fixed string name ImputedNaN. (default=’mode’)
num_strategy (string) – Strategy for imputing NaN exist in numerical columns. Either mean, median or mode can be specified and if any other string is specified, mean imputation will be employed. (default=’mean’)
Methods
fit
(X[, y])Fit transformer by identifying numerical and categorical columns.
transform
(X)Transform X by imputing with values obtained from fitting stage.
-
fit
(X, y=None)[source]¶ Fit transformer by identifying numerical and categorical columns. Then, based on the strategy fit will store the values used for NaN existing in each columns.
- Parameters
X (pandas.DataFrame) – Input dataframe
y (pandas.Series) – Ignored. (default=None)
- Returns
fitted object (self)
- Return type
object
-
class
preprocessing.
OneHotEncoding
(drop_first=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
One Hot Encoding of categorical variables.
- Parameters
drop_first (bool) – Whether to drop first column after one hot encoding in order to avoid multi-collinearity. (default=True)
Methods
fit
(X[, y])Fit transformer by getting column names after one hot encoding.
transform
(X)Transform X by one hot encoding.
-
fit
(X, y=None)[source]¶ Fit transformer by getting column names after one hot encoding.
- Parameters
X (pandas.DataFrame) – Input dataframe
y (pandas.Series) – Ignored. (default=None)
- Returns
fitted object (self)
- Return type
object
-
transform
(X)[source]¶ Transform X by one hot encoding. This will drop new columns in the input and impute with dummy column with all values = 0 for column that has been deleted from fitting stage.
- Parameters
X (pandas.DataFrame) – Input dataframe
- Returns
Transformed input DataFrame
- Return type
pandas.DataFrame
-
class
preprocessing.
BinarizeNaN
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Find a column with missing values, and create a new column indicating whether a value was missing (0) or not (1).
Methods
fit
(X[, y])Fit transformer by getting column names that contains NaN
transform
(X)Transform by checking for columns containing NaN value both during the fitting and transforming stage, then binalizing NaN to a new column.
-
class
preprocessing.
CountRowNaN
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Calculates total number of NaN in a row and create a new column to store the total.
Methods
fit
(X[, y])Fit transformer by getting column names during fit.
transform
(X)Transform by checking for columns that exists in both the fitting and transforming stage.
-
class
preprocessing.
ClipData
(threshold=0.99)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Clip datasets by replacing values larger than the upper bound with upper bound and lower than the lower bound by lower bound. Missing values will be ignored.
- Parameters
threshold (float) – Threshold value for to define upper and lower bound. For example, 0.99 will imply upper bound at 99% percentile annd lower bound at 1% percentile. (default=0.99)
Methods
fit
(X[, y])Fit transformer to get upper bound and lower bound for numerical columns.
transform
(X)Transform by clipping numerical data
-
class
preprocessing.
GroupRareCategory
(threshold=0.01)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Replace rare categories that appear in categorical columns with dummy string.
- Parameters
threshold (float) – Threshold value for defining “rare” category. For example, 0.01 will imply 1% of the total number of data as “rare”. (default=0.01)
Methods
fit
(X[, y])Fit transformer to define and store rare categories to be replaced.
transform
(X)Transform by replacing rare category with dummy string.
-
class
preprocessing.
TargetMeanEncoding
(k=0, f=1, smoothing=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Target Mean Encoding of categorical variables. Missing values will be treated as one of the categories.
- Parameters
k (float) – hyperparameter for sigmoid function (default=0.0)
f (float) – hyperparameter for sigmoid function (default=1.0)
smoothing (float) – Whether to smooth target mean with global mean using sigmoid function. Do not recommend smoothing=False. (default=0.01)
Methods
fit
(X[, y])Fit transformer to define and store target mean smoothed target mean for categorical variables.
transform
(X)Transform by replacing categories with smoothed target mean
-
class
preprocessing.
StandardScaling
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Standardize datasets to have mean = 0 and std = 1. Note this will only standardize numerical data and ignore missing values during computation.
Methods
fit
(X[, y])Fit transformer to get mean and std for each numerical features.
transform
(X)Transform by subtracting mean and dividing by std.
-
class
preprocessing.
MinMaxScaling
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Rescale the fit data into range between 0 and 1. Note this will only standardize numerical data and ignore missing values during computation. If there are values larger/smaller than fit data in the transform data, the value will be larger than 1 or less than 0.
Methods
fit
(X[, y])Fit transformer to get min and max values for each numerical features.
transform
(X)Transform by subtracting min and dividing by max-min.
-
class
preprocessing.
CountEncoding
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Encode categorical variables by the count of category within the categorical column.
Methods
fit
(X[, y])Fit transformer to define categorical variables and obtain occurrence of each categories.
transform
(X)Transform by replacing categories with counts
-
class
preprocessing.
RankedCountEncoding
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Firstly encode categorical variables by the count of category within the categorical column. Then, counts are ranked in descending order and the ranks are used to encode category columns. Even in case there are categories with same counts, ranking will be based on the index and therefore the categories will be distinguished. RankedFrequencyEncoding is not provided as the result will be identical to this class.
Methods
fit
(X[, y])Fit transformer to define categorical variables and obtain ranking of category counts.
transform
(X)Transform by replacing categories with ranks
-
class
preprocessing.
FrequencyEncoding
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Encode categorical variables by the frequency of category within the categorical column.
Methods
fit
(X[, y])Fit transformer to define categorical variables and obtain frequency of each categories.
transform
(X)Transform by replacing categories with frequency
-
class
preprocessing.
RankedTargetMeanEncoding
(k=0, f=1, smoothing=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Ranking with Target Mean Encoding of categorical variables. Missing values will be treated as one of the categories. This will treat Categories with same target mean separately as the rank is obtained from index once sorted by target mean.
- Parameters
k (float) – hyperparameter for sigmoid function (default=0.0)
f (float) – hyperparameter for sigmoid function (default=1.0)
smoothing (float) – Whether to smooth target mean with global mean using sigmoid function. Do not recommend smoothing=False. (default=0.01)
Methods
fit
(X[, y])Fit transformer to define and store target mean smoothed target mean for categorical variables.
transform
(X)Transform by replacing categories with rank
-
fit
(X, y=None)[source]¶ Fit transformer to define and store target mean smoothed target mean for categorical variables. Then, ranking is created based on target mean.
- Parameters
X (pandas.DataFrame) – Input dataframe
y (pandas.Series) – Input Series for target variable
- Returns
fitted object (self)
- Return type
object
-
class
preprocessing.
AppendAnomalyScore
(n_estimators=100, random_state=1234)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append anomaly score calculated from isolation forest. Since IsolationForest needs to be fitted, category columns must first be encoded to numerical values.
- Parameters
n_estimators (int) – Number of base estimators in the Isolation Forest ensemble. (default=100)
random_state (int) – random_state for Isolation Forest (default=1234)
Methods
fit
(X[, y])Fit Isolation Forest
transform
(X)Transform X by appending anomaly score
-
class
preprocessing.
AppendCluster
(n_clusters=8, random_state=1234)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append cluster number obtained from kmeans++ clustering. For clustering categorical variables need to be converted to numerical data.
- Parameters
n_clusters (int) – Number of clusters (default=8)
random_state (int) – random_state for KMeans (default=1234)
Methods
fit
(X[, y])Fit KMeans Clustering
transform
(X)Transform X by appending cluster number
-
class
preprocessing.
AppendClusterDistance
(n_clusters=8, random_state=1234)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append cluster distance obtained from kmeans++ clustering. For clustering categorical variables need to be converted to numerical data.
- Parameters
n_clusters (int) – Number of clusters (default=8)
random_state (int) – random_state for KMeans (default=1234)
Methods
fit
(X[, y])Fit KMeans Clustering
transform
(X)Transform X by appending cluster distance
-
class
preprocessing.
AppendPrincipalComponent
(n_components=5, random_state=1234)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append principal components obtained from PCA. For pca categorical variables need to be converted to numerical data. Also, data should be standardized beforehand.
- Parameters
n_components (int) – Number of principal components (default=5)
random_state (int) – random_state for PCA (default=1234)
Methods
fit
(X[, y])Fit PCA
transform
(X)Transform X by appending principal components
-
class
preprocessing.
AppendArithmeticFeatures
(max_features=50, metric='roc_auc', operation='multiply', replace_zero=0.001)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A transformer which recognizes all numerical features and create new features by arithmetic operation. Newly created features are evaluated individually by fitting Logistic Regression against the target variable and only new features with higher eval metric than feature pairs will be newly added to the data. Missing values need to be imputed beforehand.
- Parameters
max_features (int) – Number of numerical features to test combinations. If number of numerical features in the data exceeds this value, transformer will raise an exception. (default=50)
metric (string) – Metrics to evaluate feature. Sklearn default metrics can be used. (default=’roc_auc’)
operation (string) – Type of arithmetic operations. ‘add’, ‘subtract’, ‘multiply’, ‘divide’ can be used. (default=’multiply’)
replace_zero (float) – Value to replace 0 when operation=’divide’ . Do not use 0 as it may cause ZeroDivisionError.(default=0.001)
Methods
fit
(X[, y])Fit transformer by fitting each feature with Logistic Regression and storing features with eval metrics higher than the max of existing features.
transform
(X)Transform X by creating new feature using pairs identified during fit.
-
fit
(X, y=None)[source]¶ Fit transformer by fitting each feature with Logistic Regression and storing features with eval metrics higher than the max of existing features.
- Parameters
X (pandas.DataFrame) – Input dataframe
y (pandas.Series) – Input Series for target variable
- Returns
fitted object (self)
- Return type
object
-
class
preprocessing.
RankedEvaluationMetricEncoding
(metric='roc_auc')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Encode categorical columns by firstly creating dummy variable, then LogisticRegression against target variable is fitted for each of the dummy variables. Evaluation metric such as accuracy or roc_auc is calculated and ranked. Finally, categories are encoded with its rank. It is strongly recommended to conduct DropHighCardinality or GroupRareCategory before using this encoding as this encoder will fit Logistic Regression for ALL categories with 5-fold.
- Parameters
metric (string) – Metrics to evaluate feature. Sklearn default metrics can be used. (default=’roc_auc’)
Methods
fit
(X[, y])Fit transformer by creating dummy variable and fitting LogisticRegression.
transform
(X)Transform X by replacing categories with its evaluation metric
-
class
preprocessing.
AppendClassificationModel
(model=None, probability=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append prediction from model as a new feature. Model must have fit and predict methods and it should only predict a single label. In case the model has a predict_proba method, option probability can be used to append class probability instead of class labels. predict_proba method must return class probability for 0 as first column and 1 as second column.
- Parameters
model (object) – Any model that is in line with sklearn classification model, meaning it implements fit and predict. (default=None)
probability (bool) – Whether to class probability instead of class labels. If True, model must have predict_proba method implemented.(default=False)
Methods
fit
(X[, y])Fit transformer by fitting model specified.
transform
(X)Transform X by predicting with fitted model.
-
class
preprocessing.
AppendEncoder
(encoder=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append encoders in the DataLiner module. Encoders in DataLiner will automatically replace categorical values, but by wrapping DataLiner Encoders with this class, encoded results will be appended as a new feature and original categorical columns will remain. Regardless of whether the Encoder will require target column or not, this class will require target column.
- Parameters
encoder (object) – DataLiner Encoders.(default=None)
Methods
fit
(X[, y])Fit transformer by fitting encoder specified
transform
(X)Transform X by appending encoded category
-
class
preprocessing.
AppendClusterTargetMean
(n_clusters=8, random_state=1234)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Append cluster number obtained from kmeans++ clustering. Then each cluster number is replaced with target mean. For clustering categorical variables need to be converted to numerical data.
- Parameters
n_clusters (int) – Number of clusters (default=8)
random_state (int) – random_state for KMeans (default=1234)
Methods
fit
(X[, y])Fit KMeans Clustering and obtain target mean
transform
(X)Transform X by appending cluster mean
-
class
preprocessing.
PermutationImportanceTest
(threshold=0.0001)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Conduct permutation importance tests on features and drop features that are not effective. Basically it will firstly fit entire data, then randomly shuffle each feature’s data and evaluate the metrics for both cases. If shuffled case has no difference in the evaluation then that means the feature is not effective in prediction.
- Parameters
threshold (float) – Average difference in roc_auc between original and shuffled dataset. Higher the value, more features will be dropped. (default=0.0001)
Methods
fit
(X[, y])Conduct permutation importance test and store drop features.
transform
(X)Transform X by dropping columns specified in drop_columns
-
class
preprocessing.
UnionAppend
(append_list=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Concatenates features extracted from original input data by AppendXXX in the DataLiner package. Normally by applying AppendXXX in pipeline input data will be treated in series and therefore appended feature will be used in the next AppendXXX. By wrapping list of AppendXXX with this class, append features will be processed in parallel and therefore each AppendXXX class will only use the original input features.
- Parameters
append_list (list) – List of AppendXXX in DataLiner package. (default=None)
Methods
fit
(X[, y])Fit transformer by verifying AppendXXX specified.
transform
(X)Transform X by transforming X in parallel and appending to original input data.