Introduction to Statistics


Gaurav Kumar Feb 01 2021 · 1 min read
Share this
Statistics is a field of mathematics that is universally agreed to be a prerequisite for a deeper understanding of machine learning. Although statistics is a large field with many esoteric theories and findings, the nuts and bolts tools, and notations taken from the field are required for machine learning practitioners. With a solid foundation of what statistics is, it is possible to focus on just the good or relevant parts.

What is Statistics?

Statistics is a science of collecting, organizing, presenting, analyzing, and interpreting the data in an effective way to get some insight into data. Statistics is a collection of tools that you can use to get answers to important questions about data. You can use descriptive statistical methods to transform raw observations into information that you can understand and share.

Why is Statistics Important to Machine Learning?  

  •       Problem Framing: Requires the use of exploratory data analysis and data mining.
  •         Data Understanding: Requires the use of summary statistics and data visualization.
  •         Data Cleaning. Requires the use of outlier detection, imputation, and more.
  •         Data Selection. Requires the use of data sampling and feature selection methods.
  •        Data Preparation. Requires the use of data transforms, scaling, encoding, and much more.
  •         Model Evaluation. Requires experimental design and resampling methods.
  •         Model Configuration. Requires the use of statistical hypothesis tests and estimation statistics.
  •         Model Selection. Requires the use of statistical hypothesis tests and estimation statistics.
  •        Model Presentation. Requires the use of estimation statistics such as confidence intervals.
  •         Model Predictions. Requires the use of estimation statistics such as prediction intervals.

  • Types of data:

  • Quantitative data    
  • Categorical data
  • Quantitative data:

    When you collect quantitative data, the numbers you record represent real amounts that can be added, subtracted, divided, etc. There are two types of quantitative variables: discrete(Counts of individual items or values.) and continuous(Measurements of continuous or non-finite values.).

    Categorical data:

    Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things.

    There are three types of categorical variables: binary(Yes/no outcomes.), nominal(Groups with no rank or order between them.), and ordinal variables(Groups that are ranked in a specific order.).

    Data Types
    Read next