Feature Selection is also known as feature pruning. It is a very important step in the pipeline of building a good prediction model. It helps to understand the connections between the features and the target.
The goal of feature selection is :
a) To identify and remove features with little or no predictability. It will prevent overfitting.
b) To identify highly correlated features and to suppress the negative impact on the model.
We will review the following approaches to achieve feature selection in the context of linear and logistic regression :
1) Statistical Interference :
The approach estimates the standard error of the coefficient of the regression model and then constructs a confidence interval and p-value to test whether the coefficients are different than 0 or not. When the null hypothesis of the coefficient being zero is rejected with a small p-value, that means this feature has some genuine effects on the target.
Central Limit Theorem states that the approximate distribution of the coefficient is a normal distribution for a large sample size.
The key to defining this distribution is to estimate the standard deviation of the coefficient. This value can be taken as a measure of the precision of the coefficient.
The standard error of the model coefficients can be calculated by the square roots of the diagonal entries of the covariance matrix. The calculation for standard error of coefficients in Python is available in the statsmodel Library.
This approach estimates that the standard error of the regression’s coefficient that serves as a direct metric, which helps in evaluating the connection between features and target.
2) Greedy Search :
In comparison with the approach stated above, this method is more practical and leaning towards machine learning engineering. The general idea of the method is to generate models with the help pf various combinations of features and also to narrow down feature subsets with the optimal model performance. There are many alterations of the greedy search strategy. Here, we will discuss two of them :
i) Univariate Selection :
It is the simplest of all the search methods. It evaluates how a good feature is. It is done by estimating its predictive value when taken alone in respect of the response. And, it also removes the features that perform poorly in the test.
ii) Recursive Elimination :
It starts the feature selection process backward with a full feature space. At every iteration, a random feature is removed, and the performance of the model is reevaluated.
3) Regularization :
It is another way of identifying and modifying the important features, without removing any features from the original data set. Regularization curtails the coefficients of these features so that they do not contribute to prediction results.
This approach is divided into two separate branches based on how the coefficient is penalized.
If the regularization is penalized on the absolute value of the coefficients i.e. L1 norm then the algorithm is called Lasso Regression. And, if the regularization is penalized on the sum of the square of the coefficient i.e. L2 norm, the algorithm is called Ridge Regression.
A little difference in the penalty term results in completely different behavior of that two regularization algorithm.
Lasso Regression is used as a method for feature selection due to its inability to assign zero as a feature coefficient. There are often variations of regularization as well but all have their own merits and demerits.