How to build a simple Webscraper using Beautifulsoup

#webscraping #beautifulsoup

Retin P Kumar Nov 16 2021 · 5 min read
Share this

Introduction

Data collection is one of the important steps after defining and analyzing a problem in a data science project. Though there are multiple ways of collecting data, in this article we will focus on web scraping techniques to collect data for our data science projects.

Building a universal web scraper is neither feasible nor desirable. Every web scraper should be made for a particular situation and should be modified according to the changes in the page to be scraped.

But, here we will focus on creating a simple scraper class using Beautifulsoup and Requests library that can scrape — by default, links, and images — any static webpage. And of course, we will try to scrape a static webpage as well.

Defining the steps

We will make use of OOP with python to create this scraper from scratch.

Let's list out the functions we require to build this scraper

1. A method for parsing the URL

This method should create a custom object that could be used to access any web element on the given page.

2. A method for fetching all links from the webpage

3. A method for fetching all image links from the webpage

4. A method for downloading images from the image links to your current directory.

Now that we have listed the core functionalities of this scraper, let's build the scraper step by step.

Building the scraper

Let's create the scraper class and define our methods. We will call this scraper by the name "StaticSiteScraper".

class StaticSiteScraper:
    def __init__(self):
        pass

    # A method for parsing the url
    def url_parse(self):
        pass

    # A method for fetching all links from the webpage
    def get_all_links(self):
        pass

    # A method for fetching all image links from the webpage
    def get_all_image_links(self):
        pass

    # A method for downloading images from the image links
    def download_images(self):
        pass  

1. Fetching the HTML page content

Let's define the function for fetching the HTML page.

We will use the requests library to fetch the HTML content from the URL and store it in a variable called data.

data = requests.get(url).text

Then, we will create a soup object that will parse this HTML content into readable text format.

soup = bs(data, parser)

And finally our function will return this soup object.

Putting it all together, we have:

def url_parse():
    data = requests.get(url).text
    soup = bs(data, parser)
    return soup

The soup object created can be used to access any element within the parsed HTML content.

For example, we can access the title text of the webpage using the following:

soup.title.text

Similarly, you can use the soup object to access any element you need.

Now that we have our soup object, we can use this object to create methods for fetching all the links and images from the parsed HTML content.

2. Fetching all the links

Let's start by defining a list for storing all the links we collect using the method. This list will be the output for our method.

We know that links are stored in the <a> tag within an HTML page. So, to access the links, we need to access the <a> tag first which can be done using:

 soup.find_all('a')

This will fetch us a list of all the <a> tags in the parsed HTML content.

Now, we will have to iterate through each of the elements within this list and extract the link from the "href" attribute. Then, we will store the collected links in the list that we had defined earlier.

for tag in a_tags:
    items = str(tag).split(" ")

    for item in items:
        if 'href' in item:
            link_item = item.split("=")[1].split("\"")[1]
            if 'http' in link_item or 'www' in link_item:
                links.append(link_item)

Putting it all together, we have:

def get_all_links():
    links=[]

    a_tags = soup.find_all('a')

    for tag in a_tags:
        items = str(tag).split(" ")
        for item in items:
            if 'href' in item:
                link_item = item.split("=")[1].split("\"")[1]
                if 'http' in link_item or 'www' in link_item:
                    links.append(link_item)
    return links

3. Fetching all image links

This step is exactly similar to the step above where we had collected all the links from the parsed content.

The only difference here is that instead of <a> tag, we use<img> tag here. And also, instead of "href", we use "src" attribute for accessing the image links.

def get_all_image_links():
    img_links=[]

    img_tags = soup.find_all('img')
    for tag in img_tags:
        items = str(tag).split(" ")
        for item in items:
            if 'src' in item:
                link_item = item.split("=")[1].split("\"")[1]
                if 'http' in link_item or 'www' in link_item:
                    img_links.append(link_item)
    return img_links

4. Downloading the images

Now, we have all the image links with us, we will now proceed to download all the images from these links and store them in our local directory. For that, we will pass the image list as a parameter to this method.

We will start by creating a date_ variable that stores the current day date in "dd-mm-yyyy" format.

day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year

date_ = str(day)+"-"+str(month)+"-"+str(year)+"_"

Before creating a new directory to store the images, we will check for the presence of the "images" directory in our current working directory.

If not found, we will create a new "images" directory. Inside the "images" directory, we will create a new directory with the current day date as its name to store the images.

if 'images' not in os.listdir():
        os.mkdir("images")
        os.chdir("images")

    os.mkdir(date_)
    os.chdir(date_)

Now, we will initiate an image number variable to name the images. We can access the <alt> tag to fetch the image name from the URL, but sometimes, images won't have an alt name. So, to keep image names uniform, we will number the images.

img_no = 1

Next, we will iterate through the image links, fetch the images and store it in our local directory.

for link in image_list:
    img_response = requests.get(link)

    img_format = link.split(".")[-1]

    filename = "img" + str(img_no) + "." + img_format

    with open(filename, "wb+") as f:
        f.write(img_response.content)
    img_no += 1

Putting it all together, we have:

def download_images(image_list):
    day = datetime.date.today().day
    month = datetime.date.today().month
    year = datetime.date.today().year

    date_ = str(day)+"-"+str(month)+"-"+str(year)+"_"

    if 'images' not in os.listdir():
        os.mkdir("images")
        os.chdir("images")

    os.mkdir(date_)
    os.chdir(date_)

    img_no = 1

    for link in image_list:
        img_response = requests.get(link)

        img_format = link.split(".")[-1]

        filename = "img" + str(img_no) + "." + img_format

        with open(filename, "wb+") as f:
            f.write(img_response.content)
        img_no += 1

Now that we have defined all our methods, we will put it all together to create our StaticSiteScraper class. I'm not going to go deep here as this is self-explanatory.

Here's how our final scraper class will look like.

class StaticSiteScraper:
    '''
        Author: Retin P Kumar
        Contact: [email protected]
        
        Methods
        =======
        
        url_parse
        ---------
        Returns the parsed html content of given url
        
        get_title
        ---------
        Returns the title of the given webpage
        
        get_all_links
        -------------
        Returns a list of all the links in the given url
        
        get_all_image_links
        -------------------
        Returns a list of all the links of images in the given url
        
        download_images
        ---------------
        Checks for an "images" directory within the current working directory.
        If not found, creates an "images" directory in the current working 
        directory. Saves all the images from the links returned by get_all_image_links
        method.       
    '''
    # importing libraries
    import os
    import datetime
    import requests
    from bs4 import BeautifulSoup as bs
    
    def __init__(self, url: str, parser: str='html.parser'):
        '''
            Parameters
            ----------
            url: url of webpage
            parser: object for parsing text files
            
            Returns
            -------
            self: object
        '''
        self.url = url
        self.parser = parser
    
    def url_parse(self): 
        '''
        A method for parsing the url
        
            Parameters
            ----------
            self: object
            
            Returns
            -------
            self.soup: parsed web html content
        '''
        try:
            # getting response from webpage
            data = self.requests.get(self.url).text
            # creating a html parsed object
            self.soup = self.bs(data, self.parser)
            return self.soup
        except Exception as e:
            print("Url not parsable. ", e)
        finally:
            # printing a prettified version of parsed html content
            print(self.soup.prettify())
    
    def get_title(self):
        '''
        A method for fetching the page title
        
            Parameters
            ----------
            self: object
            
            Returns
            -------
            title: webpage title text
        '''
        try:
            # fetching the webpage title
            title = self.soup.title.text
            print(title)
        except:
            print("Cannot find title.")
            
    def get_all_links(self):
        '''
        A method for fetching all links from the webpage
        
            Parameters
            ----------
            self: object
            
            Returns
            -------
            links: list of all links in the given url
        '''
        links=[] # list to store all the links
        try:
            # fetching all <a> tags
            self.a_tags = self.soup.find_all('a')
        except:
            print("Cannot find all <a> tags")
        
        # iterating through each element in the list of <a> tags
        for tag in self.a_tags:
            items = str(tag).split(" ")
            for item in items:
                if 'href' in item:
                    link_item = item.split("=")[1].split("\"")[1]
                    if 'http' in link_item or 'www' in link_item:
                        links.append(link_item)
        return links
            
    def get_all_image_links(self):
        '''
        A method for fetching all image links from the webpage
        
            Parameters
            ----------
            self: object
            
            Returns
            -------
            self.img_links: list of all image links in the given url
        '''
        self.img_links=[] # list to store all images
        
        # iterating through each element in the list of images
        img_tags = self.soup.find_all('img')
        for tag in img_tags:
            items = str(tag).split(" ")
            for item in items:
                if 'src' in item:
                    link_item = item.split("=")[1].split("\"")[1]
                    if 'http' in link_item or 'www' in link_item:
                        self.img_links.append(link_item)
        return self.img_links
        
    def download_images(self, image_list):
        '''
        A method for downloading images from the image links
        
            Parameters
            ----------
            self: object
            image_list: list of image links
            
            Returns
            -------
            Images in the image links inside an 'images' directory within current
            working directory.
        '''
        self.image_list = image_list
        try:
            # fetching today's date for directory name
            day = self.datetime.date.today().day
            month = self.datetime.date.today().month
            year = self.datetime.date.today().year

            date_ = str(day)+"-"+str(month)+"-"+str(year)+"_"
 
            # creating a new directory to store images
            if 'images' not in self.os.listdir():
                self.os.mkdir("images")
                self.os.chdir("images")
            
            self.os.mkdir(date_)
            self.os.chdir(date_)
            
            #image number
            img_no = 1
                
            # iterating through list of image links
            for link in self.image_list:
                # fetching image from webpage
                img_response = self.requests.get(link)
                
                #file format
                img_format = link.split(".")[-1]
                
                # creating a filename to store the image
                filename = "img" + str(img_no) + "." + img_format
                
                # saving the image using the filename
                with open(filename, "wb+") as f:
                    f.write(img_response.content)
                img_no += 1
                
            print(f"{len(self.img_links)} images downloaded succesfully into 'images/{date_}' directory.")    
        except Exception as e:
            print("Error while downloading images: ", e)

Implementing our static scraper

Finally, we have made our scraper class.

It's time that we now test our StaticSiteScraper. For that, I'm now moving on to craigslist.org's New York real estate page.

Getting the URL from the page

url = "https://newyork.craigslist.org/d/real-estate/search/rea"

Creating a craig object using our StaticSiteScraper

# Instantiating class object
craig = StaticSiteScraper(url)

Creating a page object that can be used for accessing any other HTML elements within this page if required.

# Parsing html contents
page = craig.url_parse()

Now, we will access all the links within this webpage.

# Fetching all links within the webpage
craig.get_all_links()

Similarly, we will now fetch the links of all the images on the webpage.

# Fetching the links of all images within the webpage
imagedata = craig.get_all_image_links()
imagedata

And we have our output as:

['//www.craigslist.org/images/animated-spinny.gif',
 '//www.craigslist.org/images/animated-spinny.gif']

Before proceeding to download all the images, make sure to check the following:

1. If we were able to fetch image links

Why I'm saying this is because almost all of the websites post their images within a <div> tag and maybe within more tags inside the <div> tag. So, it will be very hard to fetch all the images in one go.

Here, you will have to make use of the page object we had created before to dig deeper into the tags where you can find the image link. From there, you have to fetch each image and store it into a list.

2. If image links are in a proper format

Most of the time, image links might not be in a proper format and we might need to rearrange them in proper format before passing them to the function.

# formatting image list
image_list = []
for item in imagedata:
    image_list.append("http:"+item)

print(image_list)

Now our output looks like

['http://www.craigslist.org/images/animated-spinny.gif',
 'http://www.craigslist.org/images/animated-spinny.gif']
# Downloading all images within the webpage into a 'images' directory
craig.download_images(image_list=image_list)
2 images downloaded succesfully into 'images/16-11-2021_' directory.

With that, we have now created a web scraper that can fetch all the static content of a given URL.

You can find the GitHub version of the above article at  https://github.com/Retinpkumar/Webscraping/blob/main/webscraper_v1.0.ipynb

Finally, I hope you liked it. Feel free to post your valuable feedback and suggestions.

You can contact me at: [email protected] or

visit my GitHub profile at https://github.com/Retinpkumar

Comments
Read next