1. How you can define Machine Learning?
ML is a field of computer science/artificial intelligence where we solve problems based on data, we have at hand using different algorithm and probability/statistical knowledge. The term ML coined by IBM scientist Mr. Arthur Samuel in 1959 also symbolizes “self-teaching computer”.
By Tom Mitchell defined ML as "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." This is so called standard definition we seen modern days.
There are 3 main categories of ML: Supervised, unsupervised and semi-supervised Machine Learning.
Supervise ML algorithm are based on data where we know what to predict i.e., we have our input value and output value or we call it labelled data e.g.: ecommerce market discount data, weather data, past stock price data; we need figure out the relation between both parties.
Un – supervised ML algorithm has only input data, hence we need to find relation in between them for example customer segmentation.
Semi-supervised are the mix of both where we have small amount of labelled data to build a supervised ML model and applied it to huge amount of unlabeled data for un supervised learning and make a good model out of both. This is the most practical approach to solve business cases with the help of ML.
2. What do you understand Labelled training dataset?
A labelled training Dataset is a data source where outcome of the input data or result or Target is given.
3. What are 2 most common supervised ML tasks you have performed so far?
Regression model and Categorical or Classification model. In regression model we use to predict a continues output based on training data like price of an item, age of a person, where in categorical/classification is kind of yes/no or any multiple outcomes predicted between constant set of output like name of fruits (banana, apple, mango).
4. What kind of Machine learning algorithm would you used to walk robot in various unknown area?
Computer vision algorithm like CNN (convolution neural network).
5. What kind of ML algo you can use to segment your user into multiple groups?
Since segmenting in multiple groups is an unsupervised ML we can use clustering algorithm like K-Means, DBSCAN, Hierarchical clustering etc.
6. What type of learning algo realized on similarity measure to make a prediction?
A similarity measure to make a prediction worked by comparing two data point or data set on their correlation or relationship based some parameter like distance etc. KNN, SOM(Self Organizing Map) are example of ML algo that work on similarity measure to make a prediction.
7. What is an online learning system?
There are several learning strategies found; but important 2 types learning strategy Batch Vs Online, in batch we have data as a batch in hand and train data select the model based on some accuracy score. In online learning we have data in streamed way or generated sequentially to the learning algorithm for training and make prediction on the fly. But validation of these kind of learning is complex and difficult to maintain in production platform.
8. What is out of core learning?
Out of core learning is used when data is big enough to fit in a single computer RAM, sklearn used partial fit API to out of core learning.
9. Can you name couple of ml challenges that you have faced?
A. Limited Data or example for learning
B. Data quality is not good or biased data / data with outliers or highly correlated data.
C. Data cleaning approach is time consuming
D. Trained Model is under fitted/over fitted on test data
E. Lots of learning algorithm/learning strategy; what will be best fit for particular business case.
10. Can you please give 1 example of hyperparameter tuning wrt some classification algorithm?
Ridge classifier with hyperparameter ALPHA
11. What is out of bag evaluation?
In supervised learning algorithm Decision Trees, we make decisions and predict the values using an if-else conditions like trees in different branches. But problem with this method is high depth od D-Trees tends to overfitting the training data hence lead to High Variance. So we used ensemble technique like Random Forest with out of bag score/evaluation method. In Random forest we create multitude of independent decision trees (using bootstrapping ie randomly selecting training data points) at training time and outputting majority prediction from all the trees as the final output. While selecting random data points from the mail training set we left some sample, which is known as out-of-bag sample. So these samples not used while creating D-trees in Random forest;
OOB_score is the number of correctly predicted rows and OOB error is the no of wrongly classified rows from the out-of-bag sample.
OOB evaluation prevent data leakage hence reduce high variance and give better predictive model with less computations. But for high volume of data, it takes lot of time and computations.
12. What do you understand by hard & soft voting classifier?
In hard voting classifier we don’t cross or violet the margin defined by the ML model (SVM) but in soft voting this margin can be violated.