EDA : THE HIDDEN SECRET OF DATA

#datascience #visualization #python #pandas #seaborn

ROSHAN KUMAR G Oct 14 2020 · 4 min read
Share this

Why exploratory data analysis (EDA) ?

Exploratory data analysis is an approach to analyze the data. It's where a data enthusiast would be able to get an idea of overall structure of a dataset by bird's eye view. Data science often consist of advanced statistical and machine learning techniques. However, often the power of exploratory data analysis (EDA) is underestimated. In statistics, exploratory data analysis is an approach to analyzing dataset to summarize their main characteristics, often with visual methods. EDA is capable of telling us what kind of statistical techniques or modelling can be applied for the data.

EDA also plays a important role in feature engineering part as well. Having a good idea about the features in the data set, we will be able to create more significant features.

Main purpose of EDA

  • Check the missing values in data or any irrelevant characters
  • Detect the Anomalies/Outliers in data
  • Incorrect Headers of features
  • Understand each and every data point by various analysis techniques
  • Analyze the relationship between the variables
  • General steps followed

  • Missing value treatment
  • Univariate Analysis
  • Bivariate Analysis
  • Multivariate Analysis
  • Outlier treatment
  • Correlation Analysis
  • Dimensionality Reduction
  • Library Imports

    import pandas as pd 
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns 
    import warnings
    warnings.filterwarnings('ignore')

    Importing Dataset:

    data = pd.read_csv("train_yaOffsB(1).csv")

    data.shape 

    Here we will be able to observe that dataset has (88858, 10) , 10 features and 88858 data points. 

    data.head(3).append(data.tail(3))

    data['ID'].nunique() 

    Missing value analysis

    import missingno as msno
    print(data.isnull().sum())

    p = msno.bar(data, figsize = (9,6))
    From the plot we observe roughly there are 9000 missing values in Number_weeks_used column.
    data.info()
    From the info() function we can get the data types of the variables along with the non-null values present in each column
    data['Number_Weeks_Used'].fillna(method = 'ffill', inplace = True)

    data['Number_Weeks_Used'] = data['Number_Weeks_Used'].astype('int64')

    Here i have used forward fill to impute the missing values just for simplicity, you could use any of the methods such as mean, median , mode etc..or just drop the missing values.

    Summary of Data

    col = data.columns.tolist()
    col.remove('ID')
    data[col].describe(percentiles = [.25,.5,.75,.95,.97,.99])  
    Pandas describe() function provides the statistical summary about the data such as mean, max, min, standard deviation, count along with this we can also pass the percentiles where we will be able to get the idea about the outliers in the data.

    Filtering data based on condition

    data[(data['Season'] == 1) & (data['Crop_Damage'] == 1) & (data['Soil_Type'] == 0)].head()
     
    pd.DataFrame(data.groupby(['Crop_Damage','Crop_Type'])['Pesticide_Use_Category'].count())

    pd.DataFrame(data.groupby(['Crop_Damage','Season','Crop_Type'])['Estimated_Insects_Count'].count())
    df = pd.DataFrame( data[data['Crop_Damage'] == 1 ].mean(), columns = ['Values'])
    df[ 'Variance'] = pd.DataFrame( data[data['Crop_Damage'] == 1 ].var())
    df[ 'Standard deviation'] = pd.DataFrame( data[data['Crop_Damage'] == 1 ].std())
    df[ 'Median'] = pd.DataFrame( data[data['Crop_Damage'] == 1 ].median())
    df

    Graphical analysis

    plt.subplot(1,2,1)
    sns.countplot(x = 'Crop_Damage' , palette= 'cool'datadata
    plt.title("Count plot of Crop damage (target variable)")

    plt.subplot(1,2,2)
    count = train['Crop_Damage'].value_counts()
    count.plot.pie(autopct = '%1.1f%%',colors=['green','orange','blue'], figsize = (10,7),explode = [0,0.1,0.1],title = "Pie chart of Percentage of Crop_Damage")

    By the count plot and pie chart we can infer that crop alive category has larger data points as compared to the other two categories. Since this is a multi-class classification problem, this is a clear case of multi-class imbalance problem.

    plt.figure(figsize = (10,6))
    plt.subplot(1,2,1)
    sns.countplot(x = 'Crop_Type' , palette= 'cool', data= data) 
    plt.title("Count plot of Crop_Type")

    plt.subplot(1,2,2)
    sns.countplot(data['Crop_Type'], hue = data['Crop_Damage'],palette="rocket_r")
    plt.title("Plot of crop damage Vs Crop type")

    Inference

    * Crop type 0 has larger data points as compared to the crop type 1

    * More than 50000 of the crops of crop type 0 and 20000 of crops of crop type 1 are alive

    * There is more damage to crop 0 due to pesticides

    plt.figure(figsize = (15,5))
    sns.countplot(data['Number_Weeks_Used'], palette = 'hsv')
    plt.title('Count of Number_Weeks_Used')
    plt.show() 
    sns.countplot(data['Number_Doses_Week'], palette = 'hsv')
    plt.title('Count of Number_Doses_Week')
    plt.show() 

    Inference

    *By the above plot we can conclude that week 20 and week 30 has larger proportion

    * In the number of doses per week we observe that dose 20 has greater proportion  

    sns.distplot(data['Estimated_Insects_Count'], kde = True, hist = True, rug= False, bins= 30)
    plt.title("Density plot of Estimated_Insects_Count")

    plt.figure(figsize = (10,5))
    plt.subplot(1,2,1)
    sns.countplot(data['Season'], palette = 'hsv')
    plt.title('Count plot of Season')
    plt.subplot(1,2,2)
    sns.countplot(data['Season'], hue = data['Crop_Damage'], palette = 'hsv')
    plt.title('Count plot of Crop_Damage in Seasons')
    plt.show() 

    Inference

    * From the density plot we observe that Estimated insects count is right skewed

    * Count plot of crop damage at different seasons provides us the idea that, the crop damage is more  in season 1    

      

    sns.countplot(data['Season'], hue = data['Crop_Type'])
    plt.title('Count plot of Crop_type in Seasons')

    sns.countplot(data['Pesticide_Use_Category'], palette = 'dark')
    plt.title("Count plot of Pesticide_Use_Category")
    plt.show()
    sns.catplot(x = 'Pesticide_Use_Category', y = 'Estimated_Insects_Count', kind = 'box', data = data, hue = 'Crop_Damage', palette= 'rocket_r')
    plt.title("Box plot of Pesticide_Use_Category")

    Information included in Box plot

    * Minimum

    * First Quartile

    * Median (Second Quartile)

    * Third Quartile

    * Maximum

    * Idea about outliers in data

    data[col].hist(figsize=(10,15),color = 'green')

    These are some of the basic analysis that are performed on the data at the first phase, added to this we can also perform correlation analysis as well. In our case we have most of the variables are multilevel categorical variables.We cannot perform Pearson's correlation, this can be carried out by the statistical test such as ANOVA.

    Hope you find this article helpful.

    For full code visit:

    https://github.com/roshankumarg529/Hackathon/blob/master/Analytics%20vidya/Machine_Learning_in_Agriculture_EDA(1).ipynb

    Comments
    Read next