Statistics Interview Questions

###statistics ###interview_questions

ankit marwaha Oct 25 2021 · 8 min read
Share this

9.What do you understand by inferential Statistics?

Ans9.  Inferential statistics is a part of statistics where we take some sample data out of big population of data , we perform some statistical experiments on random sample of population data and based upon the results of that experiments we made some conclusions or decisions on the population data. Some of the experiments are Hypothesis testing, P value, T test, Z test, F test, Chi Square Test.

3. What do you understand by P Value? And what is use of it in ML?

Ans3. It is the probability for the null hypothesis to be true. It is also known as significance value and is provided by domain expert. Suppose we have to perform an experiment and P value provided by domain expert s 0.05 , so w.r.t normal distribution of the data,  for null Hypothesis to be true the output of the the experiment falls under 95% of the distribution of data, if it is falling outside 95% interval or tail region then we will reject the null hypothesis. 

P value can be used for feature selection , for example if we perform statistical modeling on our dataset and if we see any feature having p value greater than 0.05 then we can remove that feature from our dataset.

4.Which type of error is severe Error, Type 1 or Type 2? And why with example.

Ans4. We will understand these error with an example. Suppose there court case again a person and our null hypothesis says that person is innocent and alternate hypothesis will say that person is guilty. So in case if we know that person is innocent as per prediction but we don't have enough evidence to prove it then it that case null hypothesis will be selected and person will be convicted guilty and null hypothesis will be rejected even if it is true(False Positive). On the other hand in case if person is not convicted to be guilty even we know that he is guilty by prediction but we don't have enough evidence to prove it in that case null hypothesis will not be rejected but it is false(False negative).

10. When you are trying to calculate Std Deviation or Variance, why you used N-1 in Denominator? (Hint: Basel Connection)

Ans10.  Here comes the concept of biased estimator and unbiased estimator. So in this  case if the sample that we have taken is very closed to each other then the mean of the sample will be very far away from the actual population mean hence, it will be a biased selection of the random sample. Hence if we use N as a denominator it will be a bias estimation, so instead of dividing by N we will divide by N-1 so that variance or Std deviation between mean and data points will be more .Also we are taking only N-1 is because scientist performs various experiments and for N-1 only the sample mean and the population mean will be very near in comparison to N or N-2 or N-3.

11.What do you understand by right skewness, Give example?

Ans11.  Right skewness is also known as positive skewness this means if we say in terms of distribution of data on a graph (histogram,KDE) if the graph shows elongated line on the right hand side that means the data is positive or right skewed. Example - Wealth distribution , most of the people in the world or country comes under the same  area of  wealth distribution  where as some of the people like Mukesh Ambani, Bill gates, Elon Musk earn very high in comparison to normal people hence their wealth distribution will bit away from the normal area of other people in the world. Hence distribution of some of the points outer of he normal distribution lead to skewness and if it showed on right hand side of the mean it is know as right skewed.

8. Give me a scenario where you can use Z test and T test

Ans8. *If sample size is greater than equal to 30 than we can use Z test other wise we can use T test.

*If the population standard deviation is not given and sample S.D is given then we will use T test other wise we will use Z test only.

6. Can we use Chi square with Numerical dataset? If yes, give example. If no, give Reason?

Ans6. The Chi square test should only be performed in case of two categorical variables(Numerical or strings) if numerical well and good, if string we will convert those categories into numerical or discrete values and then we can perform Chi square test. Also there should be two or more categories in the categorical feature. Eg - if we have 2 categorical variables like sex and smoker and we want to know if there is any relationship between these variables or not. Hence first we will perform encoding to convert object values to numerical or discrete values and will then perform chi square test.

5. Where we can use chi square and have used this test anywhere in your application

Ans.5 Chi square test is used to determine the relationship between categorical Independent variables . To perform Chi Square test we should have 2 or more categorical variables and each variables should have 2 or more categories, also the categorical variables should not be paired in any possible way. Yes I have used this in my project "Insurance Premium Prediction" to find out relationship between age and smoker feature . Hence if P value is <0.05 then yes their will be relation b/w sex and smoker vice versa.

7. What do you understand by ANOVA Testing?

Ans7. If we want to see relationship between two or more groups at the same time we will use ANOVA Testing It tells whether two or more groups are same  or not based on their mean similarity and f-score. for example - If we want to test whether petal width of the flower based on some categorical variable species we have to compare the mean of each level or group .So we will do separate T test for each group, hence conducting so many test may sometimes may lead to false positives so in that case we can user ANOVA test which can be used to compare two or more groups easily at the same time

12.What is difference between Normal distribution and Std Normal Distribution and Uniform Distribution?

Ans12.  * Normal distribution  -  1.The distribution of the data is highest along mean or center and lowest at the ends along the x axis. Also called symmetrical distribution, The right hand curve is same as left hand curve like mirror faces.

2. The parameters such as mean, median and mode are almost equal and mean and SD can have any value not a specific one line std normal distribution

*Uniform Distribution - The distribution of the data is constant along the x- axis

*Standard Normal Distribution -1. The distribution of the data is symmetrical along the mean , for a standard normal distribution the mean should always be 0 and standard deviation should always be 1.

2. What kind of statistical tests you have performed in your ML Application

Ans2. I have performed T-test under statistical modeling with my dataset as a feature selection technique. This is done to get more insights on the relationship between the 2 independent variable by performing a hypothesis testing and to get final P-Value so that in case if the P value is <0.05 which shows no relationship between 2 variables.

1. Where you have used Hypothesis Testing in your Machine learning Solution.

Ans1. We have used Hypothesis testing as one of our feature selection technique, depending upon the experiments the output P value we got will conclude that do null hypothesis needs to be rejected or excepted i, e do the 2 features have any sort of relation between them or not.

13. What is different kind of Probabilistic distributions you heard of?

Ans13. 1.Binomial Distribution

2. Poisson Distribution

3.Pareto Distribution

4 Log Normal Distribution

14. What do you understand by symmetric dataset?

Ans14. If we put our words in form of distribution of the dataset then we can say that dataset will be symmetrical if the distribution on both sides of the center is same or we can say that mean median and mode are very very close to each other or almost equal. example -age, weight distribution

15.In your last project, were you using symmetric data or Asymmetric Data, if its asymmetric, what kind of EDA you have performed?

Ans15. In my last project when explored some of the features were symmetric but some were not so from my side i tried to perform robust scaling to reduce outliers from the dataset and then applied feature scaling to standardize the variables within the same range.

16.Can you please tell me formula for skewness?

Ans16. The skewness is formulated as:

       -> (3*(mean-median))/standard deviation

Such that when skewness is zero mean will be same as median and we will have normal distribution.

18. What do you understand by statistical analysis of data, Give me scenario where you have used statistical analysis in last projects?

Ans 18. As we know statistics is of two types Descriptive statistics and Inferential Statistics. In descriptive analysis we understand the data by analyzing, exploring, visualizing the data with the help of different graphs like Histogram, bar chart , scatter plot etc., on the other hand if we talk about inferential statistics , we take sample data from population, perform some experiments on sample data and based upon its results we make some decision or conclusion  on the whole population data some of experiments like hypothesis testing, Anova test, Z-Test, T-test etc. In my last project i have used descriptive statistics as i have different plots to understand the nature of features and the data , the relation between input features and between input and target variables and based upon the output analysis we will perform feature engineering and feature selection.

19. Can you please tell me criterion to apply binomial distribution, with example?

Ans19. Criterion to apply Binomial experiments:

eg - In the recent survey it was found that 85% of household in the united states have high speed internet connection . If we take the sample of 18 households . What is the probability that exactly 15 will have high speed connection.

* Here we can see that experiment is repeated fixed number of times(yes - 18 times)

*The trials done for the experiment are independent.(Yes- If one household have high speed internet connection does not depend upon on the high speed internet connection of other household)

*The trials have 2 mutually exclusive outcomes i,e  exactly 15 households will have high speed internet or not.

* Also the probability of success is same for all the trials (means it will not be a biased experiment)

20. There are 100 people, who are taking this particular 30 days Data science interview preparation course, what is the probability that 10 people will be able to make transition in 1 week? If 50 people were able to make transition in 3 weeks? (Hint: Poisson Distribution)

Ans20. The Poisson distribution is a type of of discrete probability distribution that is used or applied to occurrence of an event over a specified interval. The interval can be time, distance area , volume or may be some other unit. 

In the given question we have asked probability of a occurrence of an event within an interval of 1 week.


P(x) = (m^(x).e^(-m)/x!)

x = no. of occurrence of an event over the interval

m = mean or average or m=np(n= Total number, p=Probability)

mean = mean number of occurrence of the event over the interval


soln- x=number of people that will make  transition, mean=50 transition in 3 weeks


p(10) = (16.67^(10).2.718^(.16.67))/10!

21. lets suppose I have appeared in 3 interviews, what is the probability that I am able to crack at least 1 interview?

Ans21.  p = 1-(2/3)^3=19/27

using binomial distribution

if we will take r = 1,2,3 ---->3c1.(1/3)+(2/3)^2+3c2.(1/3)^2.(2/3)+3c3(1/3)^3.(2/3)^0=19/27(Answer)

22. Explain Gaussian Distribution in your own way.

Ans22. Let's say we have random sample x(continuous variable)  belongs to a gaussian distribution with some mean and standard deviation. So  if we have this condition , usually this type of distribution follow a bell like curved. In the bel curve distribution the center element will be mean and the curve will be symmetrical on left and right hand side of mean. It is also know as normal distribution. Here the mean , median mode are same.

23. What do you understand by 1st ,2nd and 3rd Standard Deviation from Mean?

Ans23. Standard deviation and variances helps us to specify that from the mean how far the element is distributed. If it is 1st  Standard Deviation away from mean then position will be 1sigma+mean) , if it 2nd deviation away from mean then it will be located at 2sigma+mean and similarly for 3rd deviation it will be located 3sigma + mean also the same can happen left side of mean so instead of +will subtract sigma from mean and place it for 1st, 2nd and 3rd deviation. Hence this will give us range of 1st , 2nd and 3rd standard deviation. They are also defined by Empherical formula . It specifies:

1. The probability of a variable that falls between  first  range of standard deviation is app. equals to 68%

2. The probability of a variable that falls between  second range of standard deviation is app. equals to 95%

3. The probability of a variable that falls between third range of standard deviation is app. equals to 99.7%

24. What do you understand by variance in data in simple words?

Ans24.  In simple words we can say that variance describes how far a random data point is from the the mean or center of distribution or we can say that how far the observed value is from the expected values. In mathematical terms we can say that variance is the average of the square of the difference b/w the mean  or center of distribution and random data point on the distribution.

Read next