# XGBOOST

Kriti Sinha Feb 22 2021 · 3 min read
• Preparing data and how to train your XGBoost model on a standard machine learning dataset.
• Mathematical logic with example
• XGBOOST installation
• data visualization with matplotlib.pyplot
• Making predictions and evaluate the performance of a trained XGBoost model using scikit-learn.
• XGBOOST- it is an open-source software library which provides a gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It is used for implementation of extreme gradient boosted decision trees designed for speed and better performance for machine learning. First introduced in 1999 by Tianqi Chen in C++ but now has interfaces for Python, R, Julia. Typically used with large number of levels of between 8 and 32.

Gradient Boosted Decision tree: A machine learning technique uses an ensemble of decision trees to predict a target label.

There are main two reasons to use XGBoost are also the two goals of the project:

Execution Speed. It is really faster when compared to other implementations of                                                                            gradient boosting.

Model Performance.

In simple word it is predicting the result in small parts, and then calculating complete prediction after getting MSE. Example

X and Y data table

X and Y Select an Image

At first step need to take mean of x and y data

Mean of y data will be f(0) value

Mean of y = (82+80+103+118+172+127+204+189+99+166)/10

1340/10=134

So, f0=134 for each row

Table with f0 value Select an Image

Now calculate error as y-f0

Calculate h1(x) by comparing and taking value <= mean (x) and value >= mean(x)

Mean of x is 23 so consider x<= 23 as sum (-52,-54,-31,-16)/4 and

x>=23 as sum (38,-7, 70, 55,-35, 32)/6 = 25.5

Then calculate f1 as f0+h1(x) and get error after this iteration as y-f (1)

Similarly repeat this iteration for more times to reduce error by calculating h2(x),f(2),y-f(2),

H3(x), f (3) and y-f (3)

Now calculate MSE of all error y-f (1), y-f (2) and y-f (3).

To get MSE(y-f(1)) , calculate square of each row data of y-f(1) then add all new value and divide by 10 and do similar process for MSE(y-f(2)) and MSE(y-f(3))

In above table it is clearly visible that, how with each iteration week classifier reduced error.

This complete process of boosting is faster than bagging.

In bagging ensemble technic calculates the result for prediction from multiple classifier on different sub sample of same data sample.

How to install XGBOOST?

You need to install XGBOOST by using

pip3 install xgboost

To use its classifier you need to use from xgboost import_XGBClassifier

or to use from sklearn you can use

import sklearn

from xgboost.sklearn import XGBClassifier

lets see simple use of XGBOOST by solving below problem statement

Python Implementation

Problem Statement: The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details. It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

• Number of times pregnant.
• Plasma glucose concentration 2 hours in an oral glucose tolerance test.
• Diastolic blood pressure (mm Hg).
• Triceps skinfold thickness (mm).
• 2-Hour serum insulin (mu U/ml).
• Body mass index (weight in kg/(height in m)^2).
• Diabetes pedigree function.
• Age (years).
• Is Diabetic (0 or 1)
• import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import xgboost as xgb

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

data.columns

#data visualization with matplotlib.pyplot

data.hist(figsize=(20,10))

plt.show()

x=data.drop(labels='Is Diabetic', axis=1)

y= data['Is Diabetic']

# checking for missing values

data.isna().sum()

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=7)

# fit model no training data

model = XGBClassifier()

model.fit(X_train, y_train)

# fit model no training data

model = XGBClassifier()

model.fit(X_train, y_train)

# get predictions for test data

y_pred = model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 74.02%