A Feature Selection is referred to as the process of identifying and selecting a subset of the input features which are most relevant to the target variable.
It is often straightforward when working with real-valued input and output data, but it can be very challenging when working with numerical input data and a categorical target variable.
The most commonly used feature selection methods for numerical input data when the target variable is categorical are –
a) The ANOVA f- test Statistic
b) The Mutual Information Statistic
Here, we will learn how to perform feature selection with numerical input data for classification.
Diabetes Numerical Dataset
We will use the ” diabetes” dataset as the basis which is been widely studied like a machine learning dataset since the 90s.
This dataset classifies patients’ data as to whether the onset of diabetes within five years or not.
A naive model can achieve an accuracy of around 65% on this dataset. 77% ( +/-5 %) is considered as a good score.
Numerical Feature Selection
The two popular feature selection techniques which can be used for numerical input data are –
ANOVA – f Statistic
Mutual Information Statistic
ANOVA f – test Feature Selection
Analysis of variance known as ANOVA is a parametric statistical hypothesis test that is used to determine whether the means from two or more samples of data which come from the same distribution or not.
An F – Statistic or F-test is a class of statistical tests which calculates the ratio between the value of variances like the variance from two different samples or the unexplained and explained variance with the help of the statistical test, like ANOVA. The ANOVA method is a type of F- Statistic which is referred here as ANOVA f – test.
ANOVA is generally used when one variable is numeric and the other one is categorical, like numerical input variables and a classification target variable in a classification task.
The results of this test are used for feature selection were the features which are independent of the target variables removed from the dataset.
Mutual Information Feature Selection
Mutual Information from the field of information theory is the application of information gain to feature selection.
It is calculated between two variables and measures the decrease in uncertainty for one variable given a known value of other variables.
What is Information Gain and Mutual Information?
Mutual information is straightforward when considering that the distribution of two discrete variables, like categorical input and categorical output data. It can be adapted for use with numerical input and categorical output.
Modeling with Selected Features
There are various techniques for scoring features and selecting them. How do you know which one to use?
A Robust Approach is used to evaluate models using different feature selection methods and also to select the method which results in a model of best performance.
Also, Logistic Regression is a good model for testing the feature selection methods, as it can perform better if the irrelevant features are removed from the model.