data-science

We often used algorithms in computers and machines and these algorithms are used in almost everywhere.A Machine always need all input and output to be numerical and a Categorical data is a variable that fixed observation under a particular category or group i.e.label values rather than numerical values. Hence, a categorical data is needed to be encoded into numbers before fitting into any machine and using as algorithms. For example: a color variable with values “Red”, “Blue”, “Green” that needs to be changed as “Red = 1”,”Blue = 2” & “Green = 3”. To complete the process of changing categorical data into numerical data probably two techniques are used: i) Ordinal Encoding and ii) One-Hot Encoding.

Ordinal Encoding is a technique where each categorical value is changed into integer value for example; “Dog” is 1 and “Cat” is 2. Ordinal encoding is reversible and easy to do. If the values have a common relationship between them then ordinal encoding can be used. Ordinal encoding can be done through python. By doing the program this assigns an integer value to the variable having common relation in any specific order desired. A small example of the coding is:

From numpy import as array

From sklearn preprocessing import OrdinalEncoder

Data=as array([[‘Kolkata’], [‘Mumbai’], [‘Delhi’]])

Print(data)

encoder = OrdinalEncoder()

result = encoder.fit_transform(data)

print(result)

After the program is run the expected result will be:

[[‘Kolkata’]

 [‘Mumbai’]

 [‘Delhi’]]

[[2.]

 [1.]

 [0.]]

A very popular example of ordinal data is matrix. It is intended for the input variable organized into rows and columns. The advantage that ordinal encoding provide is ease of collation, categorization and processing. The most interesting factor is that ordinal encodings advantages also form its biggest disadvantage as it fails to differentiate between given value.

Now, for categorical data where no common relationship between values exits, One-hot Encoding is uses as using of ordinal encoding will give poor performance. One-Hot Encoding is where each bit represents a possible value. The integer encoded variable is removed and one new binary variable is added for each particular integer. In case the variable fails to belong to multiple category at once then only one bit of the group will be on. It can be done in python. This in the whole concept of One-Hot Encoding and a small example of it is:

from numpy import asarray

from sklearn.preprocessing import OneHotEncoder

data = asarray([[‘Kolkata’], [‘Mumbai’], [‘Delhi’]])

print(data)

encoder = OneHotEncoder(sparse=False)

onehot = encoder.fit_transform(data)

print(onehot)

After the program is run, the result will be shown as:

[[‘Kolkata’]

 [‘Mumbai’]

 [‘Delhi’]]

[[0. 0. 1.]

 [0. 1. 0.]

 [1. 0. 0.]]

One-Hot Encoding provides advantage that its outputs are in binary and only one value is formed. Its disadvantage is that it is not as easy as Ordinal encoding and the feature space can blown up quickly due to its high cardinality. There are also many type of method by which categorical data can be converted like Label Encoding, Mean Encoding, weight of Evidence Encoding, Probability Ratio Encoding, Hashing Encoding, Helmert Encoding, Backward Difference Encoding, Leave One Out Encoding, James-Stein Encoding, M-estimator Encoding, Thermometer Encoding. All of these can be used but the best suited methods are Ordinal Encoding and One-Hot Encoding.