Machine Learning is rapidly changing the world, it is affecting every part of our daily lives. From voice assistants using NLP and machine learning to make appointments, check our calendar, and play music to automatically suggest the products to us - that are so accurate that they can predict what we will need before we even think of it.
What is Machine Learning?
Machine Learning behaves similarly to the growth of child. As the child grows, the experience (E) in performing the task (T) increases, with result in higher performance measure (P).
Machine Learning is the study of computer algorithms that improve automatically through experience . Machine Learning focuses on development of computer programs that can access data and use it to learn for themselves.
Machine Learning is the subfield of computer science that gives "computer the ability to learn without being explicitly programmed".
Application of Machine Learning in day-to-day life
Machine Learning Algorithms
Broadly there are three types of machine learning algorithms.
1. Supervised Learning
The algorithms consist of the target / outcome variable (dependent variable) which is to be predicted from a given set of the independent variables. Using these sets of variables, we generate a function that map inputs to the desired outputs. The training process continuous until the model achieves the desired level of accuracy on the training data.
Examples of supervised learning:- Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
2. Unsupervised Learning
In this algorithm, we do not have any target or outcome variable to predict / estimate. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.
Example of unsupervised learning:- Apriori algorithm, K-means.
3. Reinforcement Learning
In this algorithm, the machine is used to make a specific decision. It works in this way: the machine is exposed to an environment where it trains itself continuously using train and error. The machine learns from past experience and tries to capture the best possible knowledge to make accurate bussinss decision.
Example of Reinforcement Learing:- Markov Decision Process
List of common machine learning algorithms
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s). Here, we establish a relationship between independent and dependent variables by fitting the best line. This best fit line is known as the regression line and represented by a linear equation Y= a*X + b.
In this equation:
The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is a linear regression in real life! The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation above.
There are four assumptions associated with a linear regression model:
2. Logistic Regression
Don't get confused by its name! It is a classification, not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on a given set of the independent variable(s). In simple words, it predicts the probability of the occurrence of an event by fitting data to the algorithm. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given a wide range of puzzles/quizzes in an attempt to understand which subjects you are good at. The outcome of this study would be something like this – if you are given a trigonometry based tenth-grade problem, you are 70% likely to solve it. On the other hand, if it is a grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you.
The output of a Logistic regression model is a probability. We can select a threshold value. If the probability is greater than this threshold value, the event is predicted to happen otherwise it is predicted not to happen
3. Decision Tree
This algorithm is quite frequently used. It is a type of supervised learning algorithm that is mostly used for classification problems and regression problems. Actually, it works for both categorical and continuous dependent variables.
In the image above, you can see that population is classified into four different groups based on multiple attributes to identify ‘if they will play or not’. To split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.
Decision trees are straightforward and easy to interpret. Their visual representation can help us to understand the strength and relationship among attributes. Decision trees work with both categorical and numerical features and can handle missing values.
Decision trees are often biased towards attributes with fewer levels or unique values. They can even overfit (giant tree) if training data is noisy. Additionally, a large number of features may lead to a detailed decision tree.
1. Can work with numerical and categorical features.
2. Requires little data preprocessing: no need for one-hot encoding, dummy variables, and so on.
3. Decision Tree can automatically handle missing values.
1. Inflexible, in the sense that you can't incorporate new data into them easily. If you obtained new labeled data, you should retrain the tree from scratch on the whole dataset. This makes decision trees a poor choice for any applications that require dynamic model adjustment.
4. SVM - Support Vector Machine
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression ana
In this algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate.
Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away.
In the example shown above, the line which splits the data into two differently classified groups is the middle line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as.
5. Naive Bayes
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
P(H): The probability of hypothesis H being true. This is known as the prior probability.
P(E): The probability of the evidence.
P(E|H): The probability of the evidence given that hypothesis is true.
P(H|E): The probability of the hypothesis given that the evidence is true.
1. It is a kind of algorithm that works on Baye's theorem.
2. The class has a maximum probability is appraised as the most suitable class.
Types of Naive Bayes Algorithm.
1. Gaussian Naive Bayes - It is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data. Naive Bayes is a group of supervised machine learning classification algorithms based on the Bayes theorem. It is a simple classification technique but has high functionality.
2. Multinomial Naïve Bayes - Multinomial Naive Bayes is favoured to use on data that is multinomial distributed. It is widely used in text classification in NLP. Each event in text classification constitutes the presence of a word in a document.
3. Bernoulli Naïve Bayes - When data is dispensed according to the multivariate Bernoulli distributions then Bernoulli Naive Bayes is used. That means there exist multiple features but each one is assumed to contain a binary value. So, it requires features to be binary-valued.
6. K-Nearest Neighbours
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.
KNN makes predictions using the training dataset directly.
Predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification, this might be the mode (or most common) class value.
To determine which of the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the most popular distance measure is Euclidean Distance and Manhattan Distance.
Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) across all input attributes j.
EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )
Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance
There are many other distances that can be used such as Hamming distance, Minkowski distance, etc. You can choose the best distance metric based on the properties of the data. If you are unsure, you can experiment with different metrics and different values of K together and see which mix results in the most accurate models
The value of K can be found by algorithm tunning. It is a good idea to find different values of K (from 1 to 21), and see what works best for your problem.
KNN can be used for both regressions as well as classification problems.
KNN as regression -
When KNN is used for regression problems the prediction is based on the mean or the median of the K-most similar instances.
KNN as classification -
When KNN is used for classification, the output can be calculated as the class with the highest frequency from the K-most similar instances. Each instance in essence votes for their class and the class with the most votes is taken as the prediction.
Class probabilities can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance. For example, in a binary classification problem (class is 0 or 1):
p(class=0) = count(class=0) / (count(class=0)+count(class=1))
If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.
Best Prepare Data for KNN-
1. KNN stores the entire training dataset which it uses as its representation.
2. KNN does not learn any model
3. KNN make predictions just in time by calculating the similarity between an input sample and each training instance.
4. There are many distance measures to choose form match the structure of your input data
5. It is a good idea to rescale your data, such as normalization when using KNN.
7. K- Means
k-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed cunningly because of different location causes the different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. When no point is pending, the first step is completed and an early group age is done. At this point, we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new center. A loop has been generated. As a result of this loop, we may notice that the k centers change their location step by step until no more changes are done or in other words, centers do not move anymore. Finally, this algorithm aims at minimizing an objective function known as the squared error function given by:
‘||xi - vj||’ is the Euclidean distance between xi and vj.
‘ci’ is the number of data points in ith cluster.
‘c’ is the number of cluster centers.
1. Fast, robust, and easier to understand.
2. The flexibility of k-means allows for easy adjustment if there are problems.
3. Gives the best result when the dataset is distinct or well separated from each other.
4. Easy to interpret the clustering result.
1. Unable to handle noisy data and outliers.
2. It does not allow to develop the most optimal set of clusters and the number of clusters must be decided before the analysis.
8. Random Forest
Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyperparameter-tunning, a great result most of the time. It is also one of the most used algorithms because of its simplicity and diversity ( it can be used for both classification and regression problems )
Random Forest is a supervised learning algorithm. The "forest" is built is an ensemble of decision trees, usually trained with the "bagging" method. The general idea of the bagging method is that a combination of learning models increases the overall result.
Random Forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
One big advantage of the random forest is that it can be used for both classification and regression problems, which form the majority of current machine learning systems.
Feature Importance is one of the great qualities of the random forest and it is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this that measures the feature importance. By looking at the feature importance you can decide which features to possibly drop because they don’t contribute enough (or sometimes nothing at all) to the prediction process. This is important because a general rule in machine learning is that the more features you have the more likely your model will suffer from overfitting and vice versa.
1. It takes less training time as compared to other algorithms.
2. It predicts output with higher accuracy, even for the large dataset it runs efficiently.
3. It can also maintain accuracy when a large proportion of the data is missing.
4. Used for classification and regression.
5. Prevents overfitting of the data.
1. High Variance ( model is going to change quickly with a change in training data )
9. Gradient Boosting Algorithm -
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Types of Gradient Boosting Algorithms-
2. Light GBM
3. XG Boost
GBM is a boosting algorithm used when we deal with plenty of data to make a prediction with high prediction power. Boosting is actually an ensemble of learning algorithms which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to build strong predictors. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix.
2. Light GBM
LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
The framework is a fast and high-performance gradient boosting one based on decision tree algorithms, used for ranking, classification, and many other machine learning tasks.
3. XG Boost
The XGBoost has an immensely high predictive power which makes it the best choice for accuracy in events as it possesses both linear model and the tree learning algorithm, making the algorithm almost 10x faster than existing gradient booster techniques.
The two reasons to use XGBoost are also the two goals of the project: