Guys, does anyone know the answer?
get which statistical method we can use for replacing missing values for categorical feature mean median mode all of the above from screen.
Tackling Missing Value in Dataset
Learn about various type of missing value and how to treat them using different approaches to increase the efficacy of your model.
Nasima Tamboli — Published On October 29, 2021 and Last Modified On July 25th, 2022
Beginner Data Exploration Python
This article was published as a part of the Data Science Blogathon
The problem of missing value is quite common in many real-life datasets. Missing value can bias the results of the machine learning models and/or reduce the accuracy of the model. This article describes what is missing data, how it is represented, and the different reasons for the missing data. Along with the different categories of missing data, it also details out different ways of handling missing values with examples.The following topics are covered in this guide:
What Is Missing Data (Missing Values)?
How Missing Data/Values Are Represented In The Dataset?
Why Is Data Missing From The Dataset?
Types Of Missing Values
Missing Completely At Random (MCAR)
Missing At Random (MAR)
Missing Not At Random (MNAR)
Why Do We Need To Care About Handling Missing Data?
How To Handle Missing Values?
Checking for missing values
Figure Out How To Handle The Missing Data
Deleting the Missing values
Deleting the Entire Row
Deleting the Entire Column
Imputing the Missing Value
Replacing With Arbitrary Value
Replacing With Mean Replacing With Mode
Replacing With Median
Replacing with Previous Value – Forward Fill
Replacing with Next Value – Backward Fill
Imputing Missing Values For Categorical Features
Impute the Most Frequent Value
Impute the Value “missing”, which treats it as a Separate Category
Imputation of Missing Values using sci-kit learn library
Nearest Neighbors Imputations (KNNImputer)
Adding missing indicator to encode “missingness” as a feature
What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset.
Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.
How is Missing Value Represented In The Dataset?
In the dataset, blank shows the missing values.
In Pandas, usually, missing values are represented by NaN.
It stands for Not a Number.
The above image shows the first few records of the Titanic dataset extracted and displayed using Pandas.
Why Is Data Missing From The Dataset
There can be multiple reasons why certain values are missing from the data.
Reasons for the missing data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
The user has not provided the values intentionally.
Types Of Missing Value
Formally the missing values are categorized as follows:
Image 3Missing Completely At Random (MCAR)
In MCAR, the probability of data being missing is the same for all the observations.
In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset.
That is, missing values are completely independent of other data. There is no pattern.
In the case of MCAR, the data could be missing due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some values of overdue books in the computer system are missing. The reason might be a human error like the librarian forgot to type in the values. So, the missing values of overdue books are not related to any other variable/data in the system.
It should not be assumed as it’s a rare case. The advantage of such data is that the statistical analysis remains unbiased.
Missing At Random (MAR)
Missing at random (MAR) means that the reason for missing values can be explained by variables on which you have complete information as there is some relationship between the missing data and other values/data.
In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data and there is some pattern in the missing values.
For example, if you check the survey data, you may find that all the people have answered their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.)
Data Science, Machine Learning, Deep Learning, Data Analytics, Python, R, Tutorials, Tests, Interviews, News, AI, Cloud Computing, Web, Mobile
Python – Replace Missing Values with Mean, Median & Mode
October 3, 2021 by Ajitesh Kumar · 3 Comments
Missing values are common in dealing with real-world problems when the data is aggregated over long time stretches from disparate sources, and reliable machine learning modeling demands for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation (mean. median, mode), matrix factorization methods like SVD, statistical models like Kalman filters, and deep learning methods. Missing value imputation or replacing techniques help machine learning models learn from incomplete data. There are three main missing value imputation techniques – mean, median and mode. Mean is the average of all values in a set, median is the middle number in a set of numbers sorted by size, and mode is the most common numerical value for two or more sets.
In this blog post, you will learn about how to impute or replace missing values with mean, median and mode in one or more numeric feature columns of Pandas DataFrame while building machine learning (ML) models with Python programming. You will also learn about how to decide which technique to use for imputing missing values with central tendency measures of feature column such as mean, median or mode. This is important to understand this technique for data scientists as handling missing values one of the key aspects of data preprocessing when training ML models.
The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment. As a first step, the data set is loaded. Here is the python code for loading the dataset once you downloaded it on your system.
1 2 3 4 5 6 import pandas as pd import numpy as np
df = pd.read_csv("/Users/ajitesh/Downloads/Placement_Data_Full_Class.csv")
Here is what the data looks like. Make a note of NaN value under the salary column.Fig 1. Placement dataset for handling missing values using mean, median or mode
Missing values are handled using different interpolation techniques which estimate the missing values from the other training examples. In the above dataset, the missing values are found in the salary column. The command such as df.isnull().sum() prints the column with missing value. The missing values in the salary column in the above example can be replaced using the following techniques:
Mean value of other salary values
Median value of other salary values
Mode (most frequent) value of other salary values.
In this post, fillna() method on the data frame is used for imputing missing values with mean, median, mode or constant value. However, you may also want to check out the related post titled imputing missing data using Sklearn SimpleImputer wherein sklearn.impute.SimpleImputer is used for missing values imputation using mean, median, mode, or constant value. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located. You may also want to check out the Scikit-learn article – Imputation of missing values.
Table of Contents
How to decide which imputation technique to use?
One of the key points is to decide which technique out of the above-mentioned imputation techniques to use to get the most effective value for the missing values. In this post, the central tendency measure such as mean, median, or mode is considered for imputation. The goal is to find out which is a better measure of the central tendency of data and use that value for replacing missing values appropriately.
Plots such as box plots and distribution plots come very handily in deciding which techniques to use. You can use the following code to print different plots such as box and distribution plots.
1 2 3 4 5 6 7 8 9
import seaborn as sns
# # Box plot #
# # Distribution plot #
Here is what the box plot would look like. You may note that the data is skewed. There are several or large numbers of data points that act as outliers. Outliers data points will have a significant impact on the mean and hence, in such cases, it is not recommended to use the mean for replacing the missing values. Using mean values for replacing missing values may not create a great model and hence gets ruled out. For symmetric data distribution, one can use the mean value for imputing missing values.
Thus, one may want to use either median or mode. Here is a great page on understanding boxplots.
7 Ways to Handle Missing Values in Machine Learning
The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the…
Photo by Kevin Ku on Unsplash
7 Ways to Handle Missing Values in Machine Learning
7 Ways to Handle Missing Values in Machine Learning Popular strategies to handle missing values in the dataset
The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.
This article covers 7 ways to handle missing values in the dataset:
Deleting Rows with missing values
Impute missing values for continuous variable
Impute missing values for categorical variable
Other Imputation Methods
Using Algorithms that support missing values
Prediction of missing values
Imputation using Deep Learning Library — Datawig
Data used is Titanic Dataset from Kaggle
data = pd.read_csv("train.csv")
(Image by Author), Visualization of Missing Values: white lines denote the presence of missing value
Delete Rows with Missing Values:
Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of the rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.
(Image by Author) Left: Data with Null values, Right: Data after removal of Null valuesPros:
A model trained with the removal of all missing values creates a robust model.Cons:
Loss of a lot of information.
Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.
Impute missing values with Mean/Median:
Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column. This method can prevent the loss of data compared to the earlier method. Replacing the above two approximations (mean, median) is a statistical approach to handle the missing values.
(Image by Author) Left: Age column before Imputation, Right: Age column after imputation by the mean value
The missing values are replaced by the mean value in the above example, in the same way, it can be replaced by the median value.Pros:
Prevent data loss which results in deletion of rows or columns
Works well with a small dataset and is easy to implement.Cons:
Works only with numerical continuous variables.
Can cause data leakage
Do not factor the covariance between features.
Imputation method for categorical columns:
When missing values is from categorical columns (string or numerical) then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.
(Image by Author) Left: Data before Imputation, Right: Cabin column after imputation by ‘U’Pros:
Prevent data loss which results in deletion of rows or columns
Works well with a small dataset and is easy to implement.
Negates the loss of data by adding a unique categoryCons:
Works only with categorical variables.
Addition of new features to the model while encoding, which may result in poor performance
Other Imputation Methods:
Depending on the nature of the data or data type, some other imputation methods may be more appropriate to impute missing values.
For example, for the data variable having longitudinal behavior, it might make sense to use the last valid observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method.
For the time-series dataset variable, it makes sense to use the interpolation of the variable before and after a timestamp for a missing value.
Using Algorithms that support missing values:
All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.
The sklearn implementations of naive Bayes and k-Nearest Neighbors in Python do not support the presence of the missing values.
Another algorithm that can be used here is RandomForest that works well on non-linear and categorical data. It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.