Thursday, July 16, 2020

Web Scraping - website data scraping using BeautifulSoup


Recent days I worked on scarping data from multiple web sources. BeautifulSoup is useful library. Good thing is very easy to work also. If you have little idea about HTML then its very easy. Before exploring this package I also think that web scraping is very difficult. But once you have clear idea what you want and what format you want then its easy to get.

Step1: Import required packages
import requests
from bs4 import BeautifulSoup

Step2: Define the url which you want to gathere data from
base_site = "https://en.wikipedia.org/wiki/Music"

in my case, I gathered data from Wikipedia music site

Step3: Make a get request and check request success/fail by using status_code
response = requests.get(base_site)
response.status_code

Step4: Extract html content from response object
html = response.content

Step5: From now out BeatifulSoup journey starts
Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily. We can use different parsers like “html.parser”, “lxml”. I have written another article about parsers. Please read here

soup = BeautifulSoup(html, "html.parser")

Step6: Save html object into local file and explore the structure. This is optional step. You can do this on web page itself. This step you can decide based on requirement.

with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

# .prettify() modifies the HTML code with additional indentations for better readability

Step7: We mostly work on two methods “find” and “find_all”
Find returns fist occurrence of element in the markup, it returns None if tag not found in markup.

Here I am finding video tag in soup object

soup.find('video')

Find_all returns all occurrences in the entire markup. It returns list with all items
Here I am finding all anchor tags in soup object

links = soup.find_all('a')
  
different ways to get required element form soup object
1.     By using specific class attribute fetching anchor tags. It returns all anchor tags which had ‘mw-jump-link’ class. Class is special keyword so we have to use underscore.

soup.find_all('a', class_ = 'mw-jump-link')

2.     Used class and href attributes

soup.find('a', class_ = 'mw-jump-link', href = '#p-search')

3.     We can pass dictionary instead separate attributes

soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })
soup.find('div', {'id' : 'footer'})


In my specific case, extracted content from multiple url’s. I written method to automate my requirement. Please check my github link below.

github: https://github.com/kvsivasankar/web-scraping/blob/master/webscraping2.ipynb





No comments:

Post a Comment

Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...