Recent days I worked on scarping data from multiple web sources. BeautifulSoup is useful library. Good thing is very easy to work also. If you have little idea about HTML then its very easy. Before exploring this package I also think that web scraping is very difficult. But once you have clear idea what you want and what format you want then its easy to get.
Step1: Import required packages
import requests
from bs4 import BeautifulSoup
Step2: Define the url which you want to gathere data from
base_site = "https://en.wikipedia.org/wiki/Music"
in my case, I gathered data from Wikipedia music site
Step3: Make a get request and check request success/fail by using status_code
response = requests.get(base_site)
response.status_code
Step4: Extract html content from response object
html = response.content
Step5: From now out BeatifulSoup journey starts
Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily. We can use different parsers like “html.parser”, “lxml”. I have written another article about parsers. Please read here
soup = BeautifulSoup(html, "html.parser")
Step6: Save html object into local file and explore the structure. This is optional step. You can do this on web page itself. This step you can decide based on requirement.
with open('Wiki_response.html', 'wb') as file:
file.write(soup.prettify('utf-8'))
# .prettify() modifies the HTML code with additional indentations for better readability
Step7: We mostly work on two methods “find” and “find_all”
Find returns fist occurrence of element in the markup, it returns None if tag not found in markup.
Here I am finding video tag in soup object
soup.find('video')
Find_all returns all occurrences in the entire markup. It returns list with all items
Here I am finding all anchor tags in soup object
links = soup.find_all('a')
different ways to get required element form soup object
1. By using specific class attribute fetching anchor tags. It returns all anchor tags which had ‘mw-jump-link’ class. Class is special keyword so we have to use underscore.
soup.find_all('a', class_ = 'mw-jump-link')
2. Used class and href attributes
soup.find('a', class_ = 'mw-jump-link', href = '#p-search')
3. We can pass dictionary instead separate attributes
soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })
soup.find('div', {'id' : 'footer'})
In my specific case, extracted content from multiple url’s. I written method to automate my requirement. Please check my github link below.
No comments:
Post a Comment