# which method is used for encoding the categorical variables one hot encoder category encoder label encoder both a & c

### Mohammed

Guys, does anyone know the answer?

get which method is used for encoding the categorical variables one hot encoder category encoder label encoder both a & c from screen.

## sklearn.preprocessing.OneHotEncoder — scikit

Examples using sklearn.preprocessing.OneHotEncoder: Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.0 Release Highlights for sc...

## sklearn.preprocessing.OneHotEncoder

class sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=[source]

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters:

**categories**‘auto’ or a list of array-like, default=’auto’

Categories (unique values) per feature:

‘auto’ : Determine categories automatically from the training data.

list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in the categories_ attribute.

New in version 0.20.

**drop**{‘first’, ‘if_binary’} or an array-like of shape (n_features,), default=None

Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.

However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

None : retain all features (the default).

‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

array : drop[i] is the category in feature X[:, i] that should be dropped.

New in version 0.21: The parameter drop was added in 0.21.

Changed in version 0.23: The option drop='if_binary' was added in 0.23.

Changed in version 1.1: Support for dropping infrequent categories.

**sparse**bool, default=True

Will return sparse matrix if set True else will return an array.

**dtype**number type, default=float

Desired dtype of output.

**handle_unknown**{‘error’, ‘ignore’, ‘infrequent_if_exist’}, default=’error’

Specifies the way unknown categories are handled during transform.

‘error’ : Raise an error if an unknown category is present during transform.

‘ignore’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

‘infrequent_if_exist’ : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted 'infrequent' if it exists. If the 'infrequent' category does not exist, then transform and inverse_transform will handle an unknown category as with handle_unknown='ignore'. Infrequent categories exist based on min_frequency and max_categories. Read more in the User Guide.

Changed in version 1.1: 'infrequent_if_exist' was added to automatically handle unknown categories and infrequent categories.

**min_frequency**int or float, default=None

Specifies the minimum frequency below which a category will be considered infrequent.

If int, categories with a smaller cardinality will be considered infrequent.

If float, categories with a smaller cardinality than min_frequency * n_samples will be considered infrequent.

New in version 1.1: Read more in the User Guide.

**max_categories**int, default=None

Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. If None, there is no limit to the number of output features.

New in version 1.1: Read more in the User Guide.

Attributes:

**categories_**list of arrays

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

**drop_idx_**array of shape (n_features,)

drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop='if_binary' and the feature isn’t binary.

## Categorical encoding using Label

In Machine Learning, convert categorical data into numerical data using Label-Encoder and One-Hot-Encoder

Photo by Patrick Fore on Unsplash

**Categorical encoding using Label-Encoding and One-Hot-Encoder**

In many Machine-learning or Data Science activities, the data set might contain text or categorical values (basically non-numerical values). For example, color feature having values like red, orange, blue, white etc. Meal plan having values like breakfast, lunch, snacks, dinner, tea etc. Few algorithms such as CATBOAST, decision-trees can handle categorical values very well but most of the algorithms expect numerical values to achieve state-of-the-art results.

Over your learning curve in AI and Machine Learning, one thing you would notice that most of the algorithms work better with numerical inputs. Therefore, the main challenge faced by an analyst is to convert text/categorical data into numerical data and still make an algorithm/model to make sense out of it. Neural networks, which is a base of deep-learning, expects input values to be numerical.

There are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set. Hereby, I would focus on 2 main methods: One-Hot-Encoding and Label-Encoder. Both of these encoders are part of SciKit-learn library (one of the most widely used Python library) and are used to convert text or categorical data into numerical data which the model expects and perform better with.

Code snippets in this article would be of Python since I am more comfortable with Python. If you need for R (another widely used Machine-Learning language) then say so in comments.

**Label Encoding**

This approach is very simple and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values. Though there will be many more columns in the dataset, to understand label-encoding, we will focus on one categorical column only.

**BRIDGE-TYPE**Arch

Beam Truss Cantilever Tied Arch Suspension Cable

We choose to encode the text values by putting a running sequence for each text values like below:

With this, we completed the label-encoding of variable bridge-type. That’s all label encoding is about. But depending upon the data values and type of data, label encoding induces a new problem since it uses number sequencing. The problem using the number is that they introduce relation/comparison between them. Apparently, there is no relation between various bridge type, but when looking at the number, one might think that ‘Cable’ bridge type has higher precedence over ‘Arch’ bridge type. The algorithm might misunderstand that data has some kind of hierarchy/order 0 < 1 < 2 … < 6 and might give 6X more weight to ‘Cable’ in calculation then than ‘Arch’ bridge type.

Let’s consider another column named ‘Safety Level’. Performing label encoding of this column also induces order/precedence in number, but in the right way. Here the numerical order does not look out-of-box and it makes sense if the algorithm interprets safety order 0 < 1 < 2 < 3 < 4 i.e. none < low < medium < high < very high.

**Label Encoding in Python**

**Using category codes approach:**

This approach requires the category column to be of ‘category’ datatype. By default, a non-numerical column is of ‘object’ type. So you might have to change type to ‘category’ before using this approach.

# import required libraries

import pandas as pd import numpy as np

# creating initial dataframe

bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')

bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])

# converting type of columns to 'category'

bridge_df['Bridge_Types'] = bridge_df['Bridge_Types'].astype('category')

# Assigning numerical values and storing in another column

bridge_df['Bridge_Types_Cat'] = bridge_df['Bridge_Types'].cat.codes

bridge_df

**Using sci-kit learn library approach:**

Another common approach which many data analyst perform label-encoding is by using SciKit learn library.

import pandas as pd import numpy as np

from sklearn.preprocessing import LabelEncoder

# creating initial dataframe

bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')

bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])

# creating instance of labelencoder

labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column

bridge_df['Bridge_Types_Cat'] = labelencoder.fit_transform(bridge_df['Bridge_Types'])

bridge_df

bridge_df with categorical caolumn and label-encoded column values

स्रोत : **towardsdatascience.com**

## Ordinal and One

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding. In this tutorial, you will discover how […]

## Ordinal and One-Hot Encodings for Categorical Data

by Jason Brownlee on June 12, 2020 in Data Preparation

Last Updated on August 17, 2020

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an **Ordinal Encoding** and a **One-Hot Encoding**.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.

After completing this tutorial, you will know:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.

How to use ordinal encoding for categorical variables that have a natural rank ordering.

How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

**Kick-start your project**with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Ordinal and One-Hot Encoding Transforms for Machine Learning

Photo by Felipe Valduga, some rights reserved.

## Tutorial Overview

This tutorial is divided into six parts; they are:

Nominal and Ordinal Variables

Encoding Categorical Data

Ordinal Encoding One-Hot Encoding

Dummy Variable Encoding

Breast Cancer Dataset

OrdinalEncoder Transform

OneHotEncoder Transform

Common Questions

## Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.

A “color” variable with the values: “red“, “green“, and “blue“.

A “place” variable with the values: “first“, “second“, and “third“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

**Nominal Variable**(Categorical). Variable comprises a finite set of discrete values with no relationship between values.

**Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

### Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

Ordinal Encoding One-Hot Encoding

Dummy Variable Encoding

Let’s take a closer look at each in turn.

### Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

Guys, does anyone know the answer?