Sometimes there are many data in the datasets which doesn't contribute to the output variable.so, through the feature selection method we can easily select those feature/features which contributes most to the result variable..
Importance of Feature-Selection Techniques.
1)Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.
2)Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.
3)Through Feature Selection the accuracy of many Machine Learning Model increases..
4)Not only the accuracy but as the feature/features got reduced so, the computation power,complexity and time also decreases..So,It is a good approach before training any datasets.
Few Important Feature Selection Technique : -
Through Feature Selection we can easily understand which feature/features took important role to predict the target.
For an example , in case of medical problem, we easily can understood which parameter(s)/feature(s) is to be verified to diagnosis the patient condition! However, other non-important parameter(s) can be ignored easily..
Here, I have taken a popular datasets of Breast Cancer from sklearn library..
There are many statistical method to understand the main features which will take important role in this context, but I will show few of them.
Which method to choose is depends upon the characteristics of the datasets only..
Let's see the datasets we are going to use.....
This popular Breast Cancer Data set is having 30 features and one target variable. This is basically a binary classification problem...
Here,I will implement the following methods :----
1) Applying Pearson correlation, finding values of the coefficients and hence importance of the features.
2) Removing of all constants or quasi constants.
3)RFE method with Gradient Boosting.
4)Boruta feature selection(wrapper method).
In the next part (part-2) I will enforcement to the following features...
5)LASSO(Least Absolute Shrinkage and Selection Operation.
6)Random Forest Classifier Selection Method(select from model).
7)PCA Method(Principal Component Analysis)
8)Chi-square Feature Selection method.
Now let’s go through each model with the help of the datasets that you can download from here.
This is a classification problem where, 0 denoting->Malignant and 1 denoting->Benign ,lets count..
Let's take our four classifiers as:
1)rfc = Random Forest , 2)gbc = Gradient Boosting , 3)dtr = Decision Tree &
4)lgr = Logistic Regression.
OUTPUT of the for loop
Applying Pearson correlation, finding values of the coefficients and hence importance of the features.
As we can see at Random State 14 the accuracy is highest "99.12%" .. so, we will train our model at 14 Random State..
Now let's see with Gradient Boosting Classifier..
Now we will remove the Constant & Quasi Constants for all the 30 features if exists any.
Quasi constants are the features that are almost constants.we can say that , the features that have
same values for a large subset of the outputs and also doesn't have a very large impact on the output.
Now , we will also check if there any duplicate or repeated same features are exist or not , if found remove it.
RFE method with Gradient Boosting.
So, only 8 features are found best and important where, 'Accuracy : 98.25%' (maximum) with Random Forest Classifier..
Now, using rfc verify accuracy with random state 17.
Boruta feature selection(wrapper method).
what is wrappeer Method?
Wrapper Method basically,follws three mechaniques to select the best features from the dataset. Those are,
1)Forward Selection == Forward Selection is an iterative method in which we start with no feature in the model. After each iteration we keep adding the feature which best improves our model till an addition of a new variable doesn't improve the performance of the model.
For example, suppose in a dataset you have five features('A' ,'B' , 'C' ,'D' & 'E') as independent features and 'target' column as dependent or output feature.First, we willtrain our model with feature 'A' only , after with feature 'A' completely we will check the accuracy(let's say it is 'Accuracy 1'). In the next iteration , we will add the next feature(here,'B')train it and again get a new accuracy(let's say 'Accuracy 2'). Now, if 'Accuracy 2' is better than 'Accuracy 1' then we will consider adding this particular feature. And thus, slowly slowly we will add all the features one by one but, if at any point adding one feature will not give us better result or decreases the accuracy we will skip that particular feature..
2.Backward Elimination== In Backward Elimination, we start with all the features & removes the least significant feature at each iteration which improves the accuracy.we repeat this untill no improvement is observed on removal of features.
Let's Suppose, like previously we have the same dataset (with five independent features,'A','B', 'C' ,'D' & 'E' and a dependent feature'target'column).So, using Backward Elimination , first we will take all the features and then train the model with any statistical tests and find out the least impact features and skip those features.
3.Recursive feature elimination==It is a greedy Optimization algorithm which aims to find the best performing feature subset. It repeatedly create models by doing permutation_combination with all the features and keeps aside the best or the worst performing features at each iteration.After all the features has been exhausted it put the features based on the order of their elimination and selects the best features from the dataset.
*Disadvantage-- These all the wrapper method techniquies can only be used when the dataset is very small and also to perform these techniques we need high computation power, cost and time.So, for practicle examples these are not possible.
So, we do wrapper Method by using Boruta package for best feature selections..
What is Boruta ? And how it works?
Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise that test.
First it adds randomness to the given dataset by creating shuffled copies of all the features and train those datasets and at every iteration check wheather the features have higher importance or not ,and on the basis of that selects best features.
Continue reading the part 2 of this article here.....