Statistics : Overview

#miniview #ai #machinelearning #deeplearning #datascience

Sasi Reddyvari Dec 06 2020 · 1 min read
Share this
AI vs ML vs DL vs DataScience vs BigData

StatsTo interpret data correctly

Basic Concepts are

  • Random Variables: Basically variable is used to store value in memory and to use whenever needed. In stats, random variable is an outcome of an experiment. The two types are  :                                                             1)Numerical: Variables which contain numeric data. It is further classified as                                                                                                                           a)Discrete Random Variable: Whole number as data.                                              Ex: Age,1,2,3                                                                                         b)Continuous Random Variable: Range of values  as data.                                      Ex: Height, 10.56                                                                             2)Categorical: Variable which contains categories as data.                                  Ex: Yes/No,Low/Medium/High
  • Nominal vs Ordinal Data: Variables are simply named or labelled with no order and that data is called as Nominal data whereas Ordinal Scale has all its variables in a specific order. Nominal data can only be classified whereas Ordinal data can be classified and ordered.                  Ex: Nominal : Gender,State. Ordinal : Blood group, Rank, Salary based on education
  • Population Mean vs Sample Mean: Sample Mean is a subset of Population Mean. For Instance People in Karnataka is Population Mean where as IT people in Karnataka is Sample
  • Mean,Median,Mode: Measures of central tendency. To find the data distribution , mean is used.  To handle NAN values, Outliers; we use median, mode instead of dropping the records i.e loosing the info
  • Standard Deviation vs Variance: It is measure of how far a set of data are dispersed out from their mean or average value. Standard Deviation: 'σ'  Variance : ‘σ2’
  • Covariance vs Correlation: Covariance of x is used to quantify the relationship between random variables in the data. It depicts the direction of relationship but not the strength like how much. Correlation depicts strength and direction of relationship. Cov(X,X) = Var(X)
  • Variance vs Covariance vs Correlation
  • Pearson Correlation Coefficient(r/rho):Range lies in  -1 to +1. Best suites for linearly distributed data only.
  • Pearson Correlation Coefficient
  • Spearman's rank correlation Coefficient: Similar to pearson but here used rank on the variables. Rank is assigned for each data point by sorting in ascending order. Even the data is non linearly distributed , we get good correlated values. It finds good dependence between variables.
  • Gaussian Distribution/Normal Distribution: Contains continuous random variable, X ~ G.D(mu, sigma).                                                                   1)Bell Curve                                                                                                     2) Emperical Formuala:1Sigma-68%, 2 sigma- 95%,3 sigma-99.7%.      Ex: Iris data set: petal length ,, Distribution of height.
  • Normal Distribution / Gaussian Distribution
  • Standard Normal Distribution: Mean = 0, σ = 1.
  • Log Normal Distribution: Bell Curve with right skewed. Log applied for the data points and plotted .Ex: Income of the people, Reviews given for product's(short,long,very long). Suppose a random variable is in log Normal Distribution, it follows Gaussian and it can be converted to SND(x-mu/sigma) where mean=0, sigma =1. This process is called log normalisation.
  • Central Limit Theorem: X !~ G.D(mu, sigma) . If samples of X like x1,x2, x3,----- if we plot the Means of x1,x2,x3,-- ,then it will be under Gaussian Distribution with mean ~=mu where n>=30.  x̅ ~ G.D(mu, sigma2/n)
  • Chebyshev's Inequality: if X ~ G.D(mu, sigma) and  Y !~ G.D(mu, sigma) to find for Y what percentage is present in 1sigma, 2sigma and so on. we use the below where 
  • Chebyshev's Inequality Theorem
  • Normalization vs Standardization(z-score Normalization): Normalization is the process of scaling down the features between 0 to 1 and where as standardization is the process of scaling down the feature based on Standard Normal Distribution. This is used in Feature Scaling. MinMaxScaler, StandardScaler are used usually
  • Hope You Find this ArticLe heLpFul. Happy Learning :)

    Do support and share my article for the people who are looking out for Stats. In the Next article , i will discuss regarding Feature Engineering.

    - Check out my GitHub link, Share ,support & give a star.

    Read next