XGBOOST

#datascience #machinelearning #ml #xgboosts #boosting

Kriti Sinha Feb 22 2021 · 3 min read
Share this
  • Preparing data and how to train your XGBoost model on a standard machine learning dataset.
  • Mathematical logic with example
  • XGBOOST installation
  • data visualization with matplotlib.pyplot
  • Making predictions and evaluate the performance of a trained XGBoost model using scikit-learn.
  • XGBOOST- it is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It is used for implementation of extreme gradient boosted decision trees designed for speed and better performance for machine learning. First introduced in 1999 by Tianqi Chen in C++ but now has interfaces for Python, R, Julia. Typically used with large number of levels of between 8 and 32.

    Gradient Boosted Decision tree: A machine learning technique uses an ensemble of decision trees to predict a target label.

    There are main two reasons to use XGBoost are also the two goals of the project:

                               Execution Speed. It is really faster when compared to other implementations of                                                                            gradient boosting.

                               Model Performance.

    In simple word it is predicting the result in small parts, and then calculating complete prediction after getting MSE. Example

     X and Y data table 

    sample data table

    X and Y Select an Image

    At first step need to take mean of x and y data

    Mean of y data will be f(0) value

    Mean of y = (82+80+103+118+172+127+204+189+99+166)/10

    1340/10=134

    So, f0=134 for each row

    Table with f0 value Select an Image

    Now calculate error as y-f0

    Calculate h1(x) by comparing and taking value <= mean (x) and value >= mean(x)

    Mean of x is 23 so consider x<= 23 as sum (-52,-54,-31,-16)/4 and 

    x>=23 as sum (38,-7, 70, 55,-35, 32)/6 = 25.5

    Then calculate f1 as f0+h1(x) and get error after this iteration as y-f (1)

    Similarly repeat this iteration for more times to reduce error by calculating h2(x),f(2),y-f(2),

    H3(x), f (3) and y-f (3)

    Now calculate MSE of all error y-f (1), y-f (2) and y-f (3).

    To get MSE(y-f(1)) , calculate square of each row data of y-f(1) then add all new value and divide by 10 and do similar process for MSE(y-f(2)) and MSE(y-f(3))

    In above table it is clearly visible that, how with each iteration week classifier reduced error.

    This complete process of boosting is faster than bagging.

    In bagging ensemble technic calculates the result for prediction from multiple classifier on different sub sample of same data sample.

    How to install XGBOOST?

    You need to install XGBOOST by using

    pip3 install xgboost

    If not installed already

    To use its classifier you need to use from xgboost import_XGBClassifier

    or to use from sklearn you can use

    import sklearn

    from xgboost.sklearn import XGBClassifier

    lets see simple use of XGBOOST by solving below problem statement 

    Python Implementation

    Problem Statement: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

  • Number of times pregnant.
  • Plasma glucose concentration 2 hours in an oral glucose tolerance test.
  • Diastolic blood pressure (mm Hg).
  • Triceps skinfold thickness (mm).
  • 2-Hour serum insulin (mu U/ml).
  • Body mass index (weight in kg/(height in m)^2).
  • Diabetes pedigree function.
  • Age (years).
  • Is Diabetic (0 or 1)
  • import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    import xgboost as xgb

    from xgboost import XGBClassifier

    from sklearn.model_selection import train_test_split

    from sklearn.metrics import accuracy_score

    # reading data from csv

    data= pd.read_csv("pima-indians-diabetes.csv")

    data.head

    data.columns

    #data visualization with matplotlib.pyplot  

    data.hist(figsize=(20,10))

    plt.show()

    x=data.drop(labels='Is Diabetic', axis=1)

    y= data['Is Diabetic']

    # checking for missing values

    data.isna().sum()

    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=7)

    # fit model no training data

    model = XGBClassifier()

    model.fit(X_train, y_train)

    # fit model no training data

    model = XGBClassifier()

    model.fit(X_train, y_train)

    # get predictions for test data

    y_pred = model.predict(X_test)

    predictions = [round(value) for value in y_pred]

    # evaluate predictions

    accuracy = accuracy_score(y_test, predictions)

    print("Accuracy: %.2f%%" % (accuracy * 100.0))

    Accuracy: 74.02%

    Comments
    Read next