# which statstical method we can use for replacing missing values for categorical feature 1 point mean median mode all of the above

### Mohammed

Guys, does anyone know the answer?

get which statstical method we can use for replacing missing values for categorical feature 1 point mean median mode all of the above from screen.

## A Complete Tutorial which teaches Data Exploration in detail

Tutorial on data exploration that comprises missing value imputation, outliers, feature engineering, variable creation in data science and machine learning

**Overview**

A complete tutorial on data exploration (EDA)

We cover several data exploration aspects, including missing value imputation, outlier removal and the art of feature engineering

## Introduction

There are no shortcuts for data exploration. If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. After some point of time, you’ll realize that you are struggling at improving model’s accuracy. In such situation, data exploration techniques will come to your rescue.

I can confidently say this, because I’ve been through such situations, a lot.

I have been a Business Analytics professional for close to three years now. In my initial days, one of my mentor suggested me to spend significant time on exploration and analyzing data. Following his advice has served me well.

I’ve created this tutorial to help you understand the underlying techniques of data exploration. As always, I’ve tried my best to explain these concepts in the simplest manner. For better understanding, I’ve taken up few examples to demonstrate the complicated concepts.

## Table of Contents

**Steps of Data Exploration and Preparation**

**Missing Value Treatment**

Why missing value treatment is required ?

Why data has missing values?

Which are the methods to treat missing value ?

**Techniques of Outlier Detection and Treatment**

What is an outlier?

What are the types of outliers ?

What are the causes of outliers ?

What is the impact of outliers on dataset ?

How to detect outlier ?

How to remove outlier ?

**The Art of Feature Engineering**

What is Feature Engineering ?

What is the process of Feature Engineering ?

What is Variable Transformation ?

When should we use variable transformation ?

What are the common methods of variable transformation ?

What is feature variable creation and its benefits ?

### Let’s get started.

## 1. Steps of Data Exploration and Preparation

Remember the quality of your inputs decide the quality of your output. So, once you have got your business hypothesis ready, it makes sense to spend lot of time and efforts here. With my personal estimate, data exploration, cleaning and preparation can take up to 70% of your total project time.

Below are the steps involved to understand, clean and prepare your data for building your predictive model:

Variable Identification

Univariate Analysis Bi-variate Analysis

Missing values treatment

Outlier treatment

Variable transformation

Variable creation

Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined model.

Let’s now study each stage in detail:-

### Variable Identification

First, identify **Predictor** (Input) and **Target** (output) variables. Next, identify the data type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target variable, data type of variables and category of variables.

Below, the variables have been defined in different category:

### Univariate Analysis

At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous. Let’s look at these methods and statistical measures for categorical and continuous variables individually:

**Continuous Variables:-**In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various statistical metrics visualization methods as shown below:

**Note:**Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will look at methods to handle missing and outlier values. To know more about these methods, you can refer course descriptive statistics from Udacity.

**Categorical Variables:-**For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values under each category. It can be be measured using two metrics,

**Count**and

**Count%**against each category. Bar chart can be used as visualization.

### Bi-variate Analysis

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.

स्रोत : **www.analyticsvidhya.com**

## 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing…

Photo by Vilmos Heim on Unsplash

## 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)

6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples) Popular strategies to statistically impute missing values in a dataset.

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.

One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values. In other words, we need to infer those missing values from the existing part of the data. There are three main types of missing data:

Missing completely at random (MCAR)

Missing at random (MAR)

Not missing at random (NMAR)

However, in this article, I will focus on 6 popular ways for data imputation for cross-sectional datasets ( Time-series dataset is a different story ).

## 1- Do Nothing:

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

Let’s see some other ways to impute the missing values before training:

**Note: All the examples below use the**

**California Housing Dataset**

**from Scikit-learn.**

## 2- Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

Mean Imputation

**Pros:**

Easy and fast.

Works well with small numerical datasets.

**Cons**:

Doesn’t factor the correlations between features. It only works on the column level.

Will give poor results on encoded categorical features (do NOT use it on categorical features).

Not very accurate.

Doesn’t account for the uncertainty in the imputations.

Mean/Median Imputation

## 3- Imputation Using (Most Frequent) or (Zero/Constant) Values:

**Most Frequent**isanother statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

**Pros:**

Works well with categorical features.

**Cons:**

It also doesn’t factor the correlations between features.

It can introduce bias in the data.

Most Frequent Imputation

**Zero or Constant**imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify

## 4- Imputation Using k-NN:

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘**feature similarity**’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:

KNN Imputation for California Housing Dataset

## How does it work?

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them.

**Pros:**

Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

**Cons:**

Computationally expensive. KNN works by storing the whole training dataset in memory.

K-NN is quite sensitive to outliers in the data (**unlike SVM**)

## 5- Imputation Using Multivariate Imputation by Chained Equation (MICE)

Main steps used in multiple imputations [1]

स्रोत : **towardsdatascience.com**

## Python

Data Science, Machine Learning, Deep Learning, Data Analytics, Python, R, Tutorials, Tests, Interviews, News, AI, Cloud Computing, Web, Mobile

## Python – Replace Missing Values with Mean, Median & Mode

October 3, 2021 by Ajitesh Kumar · 3 Comments

Missing values are common in dealing with real-world problems when the data is aggregated over long time stretches from disparate sources, and reliable machine learning modeling demands for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation (mean. median, mode), matrix factorization methods like SVD, statistical models like Kalman filters, and deep learning methods. Missing value imputation or replacing techniques help machine learning models learn from incomplete data. There are three main missing value imputation techniques – mean, median and mode. Mean is the average of all values in a set, median is the middle number in a set of numbers sorted by size, and mode is the most common numerical value for two or more sets.

In this blog post, you will learn about **how to impute or replace missing values ** with **mean, median **and **mode **in one or more numeric feature columns of **Pandas DataFrame **while building machine learning (ML) models with Python programming. You will also learn about how to decide** which technique to use **for** imputing missing values** with **central tendency measures **of feature column such as **mean, median **or **mode**. This is important to understand this technique for data scientists as **handling missing values** one of the key aspects of **data preprocessing** when training ML models.

The dataset used for illustration purpose is related campus recruitment and taken from Kaggle page on Campus Recruitment. As a first step, the data set is loaded. Here is the python code for loading the dataset once you downloaded it on your system.

1 2 3 4 5 6 import pandas as pd import numpy as np

df = pd.read_csv("/Users/ajitesh/Downloads/Placement_Data_Full_Class.csv")

df.head()

Here is what the data looks like. Make a note of **NaN** value under the salary column.

**Fig 1. Placement dataset for handling missing values using mean, median or mode**

Missing values are handled using different **interpolation techniques** which estimate the missing values from the other training examples. In the above dataset, the missing values are found in the salary column. The command such as **df.isnull().sum() **prints the column with missing value. The missing values in the salary column in the above example can be replaced using the following techniques:

Mean value of other salary values

Median value of other salary values

Mode (most frequent) value of other salary values.

Constant value

In this post, **fillna()** method on the data frame is used for imputing missing values with mean, median, mode or constant value. However, you may also want to check out the related post titled imputing missing data using Sklearn SimpleImputer wherein **sklearn.impute.SimpleImputer** is used for missing values imputation using mean, median, mode, or constant value. The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located. You may also want to check out the Scikit-learn article – Imputation of missing values.

Table of Contents

## How to decide which imputation technique to use?

One of the key points is to decide which technique out of the above-mentioned imputation techniques to use to get the most effective value for the missing values. In this post, the central tendency measure such as mean, median, or mode is considered for imputation. The goal is to find out which is a **better measure of the central tendency of data** and use that value for replacing missing values appropriately.

Plots such as **box plots **and **distribution plots **come very handily in deciding which techniques to use. You can use the following code to print different plots such as box and distribution plots.

1 2 3 4 5 6 7 8 9

import seaborn as sns

# # Box plot #

sns.boxplot(df.salary)

# # Distribution plot #

sns.distplot(df.salary)

Here is what the box plot would look like. **You may note that the data is skewed.** There are several or large numbers of data points that act as outliers. Outliers data points will have a significant impact on the mean and hence, in such cases, it is not recommended to use the mean for replacing the missing values. Using mean values for replacing missing values may not create a great model and hence gets ruled out. For symmetric data distribution, one can use the **mean value** for imputing missing values.

Thus, one may want to use either median or mode. Here is a great page on understanding boxplots.

Guys, does anyone know the answer?