# which statistical method we can use for replacing missing values for categorical feature

### Mohammed

Guys, does anyone know the answer?

get which statistical method we can use for replacing missing values for categorical feature from screen.

## 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing…

Photo by Vilmos Heim on Unsplash

## 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)

6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples) Popular strategies to statistically impute missing values in a dataset.

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.

One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values. In other words, we need to infer those missing values from the existing part of the data. There are three main types of missing data:

Missing completely at random (MCAR)

Missing at random (MAR)

Not missing at random (NMAR)

However, in this article, I will focus on 6 popular ways for data imputation for cross-sectional datasets ( Time-series dataset is a different story ).

## 1- Do Nothing:

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

Let’s see some other ways to impute the missing values before training:

**Note: All the examples below use the**

**California Housing Dataset**

**from Scikit-learn.**

## 2- Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

Mean Imputation

**Pros:**

Easy and fast.

Works well with small numerical datasets.

**Cons**:

Doesn’t factor the correlations between features. It only works on the column level.

Will give poor results on encoded categorical features (do NOT use it on categorical features).

Not very accurate.

Doesn’t account for the uncertainty in the imputations.

Mean/Median Imputation

## 3- Imputation Using (Most Frequent) or (Zero/Constant) Values:

**Most Frequent**isanother statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

**Pros:**

Works well with categorical features.

**Cons:**

It also doesn’t factor the correlations between features.

It can introduce bias in the data.

Most Frequent Imputation

**Zero or Constant**imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify

## 4- Imputation Using k-NN:

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘**feature similarity**’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:

KNN Imputation for California Housing Dataset

## How does it work?

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.

**Pros:**

Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

**Cons:**

Computationally expensive. KNN works by storing the whole training dataset in memory.

K-NN is quite sensitive to outliers in the data (**unlike SVM**)

## 5- Imputation Using Multivariate Imputation by Chained Equation (MICE)

Main steps used in multiple imputations [1]

स्रोत : **towardsdatascience.com**

## Statistical Imputation for Missing Values in Machine Learning

Datasets may have missing values, and this can cause problems for many machine learning algorithms. As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short. A popular approach for data […]

## Statistical Imputation for Missing Values in Machine Learning

by Jason Brownlee on May 15, 2020 in Data Preparation

Last Updated on August 18, 2020

Datasets may have missing values, and this can cause problems for many machine learning algorithms.

As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic. It is a popular approach because the statistic is easy to calculate using the training dataset and because it often results in good performance.

In this tutorial, you will discover how to use statistical imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

Missing values must be marked with NaN values and can be replaced with statistical measures to calculate the column of values.

How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.

How to impute missing values with statistics as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

**Kick-start your project**with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

**Updated Jun/2020**: Changed the column used for prediction in examples.

Statistical Imputation for Missing Values in Machine Learning

Photo by Bernal Saborio, some rights reserved.

## Tutorial Overview

This tutorial is divided into three parts; they are:

Statistical Imputation

Horse Colic Dataset

Statistical Imputation With SimpleImputer

SimpleImputer Data Transform

SimpleImputer and Model Evaluation

Comparing Different Imputed Statistics

SimpleImputer Transform When Making a Prediction

## Statistical Imputation

A dataset may have missing values.

These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark “?”.

These values can be expressed in many ways. I’ve seen them show up as nothing at all […], an empty string […], the explicit string NULL or undefined or N/A or NaN, and the number 0, among others. No matter how they appear in your dataset, knowing what to expect and checking to make sure the data matches that expectation will reduce problems as you start to use the data.

— Page 10, Bad Data Handbook, 2012.

Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or data unavailability.

They may occur for a number of reasons, such as malfunctioning measurement equipment, changes in experimental design during data collection, and collation of several similar but not identical datasets.

— Page 63, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.

As such, it is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.

A simple and popular approach to data imputation involves using statistical methods to estimate a value for a column from those values that are present, then replace all missing values in the column with the calculated statistic.

It is simple because statistics are fast to calculate and it is popular because it often proves very effective.

Common statistics calculated include:

The column mean value.

The column median value.

The column mode value.

A constant value.

Now that we are familiar with statistical methods for missing value imputation, let’s take a look at a dataset with missing values.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Horse Colic Dataset

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

स्रोत : **machinelearningmastery.com**

## Handle missing values Categorical Features

In this post will be shown how to deal with categorical features with missing values with several examples compared to each other.

## Handle missing values in Categorical Features

Handle missing values in Categorical Features An useful guide to a proper deal with missing categorical data, with use cases

In this post, it will be shown how to deal with categorical features with missing values with several examples compared to each other. It will be used the Classified Ads for Cars dataset to predict the price of ADs through a simple model of Linear Regression.

To show the various strategies and relevant pros / cons, we will focus on a particular categorical feature of this dataset, the **maker**, the name of the brand of cars (Toyota, Kia, Ford, Bmw, …).

## Post Steps:

**Show Raw Data**: let’s see how our dataset looks like.

**Deal with missing values in Categorical Features**: we will deal missing values by comparing different techniques.

1 — **Delete **the entire column **maker**.

2 — **Replace** missing values with the most frequent values.

3 — **Delete** rows with null values.

4 — **Predict** values using a Classifier Algorithm (supervised or unsupervised).

**Conclusions!**

## Show Raw Data

Let’s start importing some libraries

import pandas as pd import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

from scipy import stats

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

First of all let’s see how our dataset looks like

filename = "cars.csv"

dtypes = {

"maker": str, # brand name

"model": str,

"mileage": float, # km

"manufacture_year": float,

"engine_displacement": float,

"engine_power": float,

"body_type": str, # almost never present

"color_slug": str, # also almost never present

"stk_year": str,

"transmission": str, # automatic or manual

"door_count": str, "seat_count": str,

"fuel_type": str, # gasoline or diesel

"date_created": str, # when the ad was scraped

"date_last_seen": str, # when the ad was last seen

"price_eur": float} # list price converted to EUR

df_cleaned = pd.read_csv(filename, dtype=dtypes)

print(f"Raw data has {df_cleaned.shape[0]} rows, and {df_cleaned.shape[1]} columns")

Raw data has 3552912 rows, and 16 columns

After cleaning all columns from missing data and not useful features (the whole procedure is shown on my github), with the exception of the **maker**,

we will find ourselves in this situation:

# Missing values

print(df_cleaned.isna().sum())

maker 212897

mileage 0

manufacture_year 0

engine_displacement 0

engine_power 0

price_eur 0

fuel_type_diesel 0

fuel_type_gasoline 0

ad_duration 0

seat_str_large 0

seat_str_medium 0

seat_str_small 0

transmission_auto 0

transmission_man 0

dtype: int64

## Correlation Matrix

corr = df_cleaned.corr()

plt.subplots(figsize=(15,10))

sns.heatmap(corr, xticklabels=corr.columns,yticklabels=corr.columns, annot=True, )

## Deal with missing values in Categorical Features

Now we just have to handle the **maker** feature, and we will do it in four different ways. Then we will create a simple model of Linear Regression for each ways to predict the price.

**1st Model**: Delete the entire column

**maker**.

**2nd Model**: Replace missing values with the most frequent values.

**3rd Model**: Delete rows with null values.

**4th Model**: Predict the missing values with the RandomForestClassifier.

mse_list = [] r2_score_list = []

def remove_outliers(dataframe):

'''

return a dataframe without rows that are outliers in any column

''' return dataframe\

.loc[:, lambda df: df.std() > 0.04]\

.loc[lambda df: (np.abs(stats.zscore(df)) < 3).all(axis=1)]

def plot_regression(Y_test, Y_pred):

'''

method that plot a linear regression line on a scatter plot

''' x = Y_test y = Y_pred

plt.xlabel("True label")

plt.ylabel("Predicted label")

plt.plot(x, y, 'o')

m, b = np.polyfit(x, y, 1)

plt.plot(x, m*x + b)

def train_and_score_regression(df):

df_new = remove_outliers(df)

# split the df

X = df_new.drop("price_eur", axis=1).values

Y = np.log1p(df_new["price_eur"].values)

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,

test_size=0.1, random_state=0)

# train and test of the model

ll = LinearRegression()

ll.fit(X_train, Y_train)

Y_pred = ll.predict(X_test)

mse_list.append(mean_squared_error(Y_test, Y_pred))

r2_score_list.append(r2_score(Y_test, Y_pred))

# print the metrics

print("MSE: "+str(mean_squared_error(Y_test, Y_pred)))

Guys, does anyone know the answer?