Saturday, July 18, 2020

Better way writing logs using logger


As a Data Scientist Programmer, you might have to build a project based on machine learning, deep learning, and other related projects. It’s easy to compile and see the output of a project are according to your requirement.

During development phase most of the time we print the output and verify. But in the production phase, it’s very important to keep track of logs in a file while the program is running. Otherwise, you will not be able to know which modules or lines of code have got errors and problems. This leads to high time consuming to fix the errors if the program crashes. In this post, you will know how to use logging in python to keep records of the occurred events within a program.

We created solution works both environments development and production. Basic idea was to maintain configuration file. Read all logging related information from this config. So, it is easy to modify any point of time by developer. In development environment we can keep different config value as per our requirement.

In our solution we have 2 main files
1.     Logging.conf 
2.     Loghelper.py

Logging.conf
Main important part is to define loggers, handlers and formatters

[loggers]  # unique name
keys=root,simpleExample,filelogExample,timedRotatingFileHandler

[handlers] # for to define different logs type (like console or file)
keys=consoleHandler,fileHandler,timedRotatingFileHandler

[formatters] # message formatter
keys=simpleFormatter

In my case, I defined 3 types
1.     simpleExample to log errors on console
2.     filelogExample to log errors on specific log file
3.     timedRotatingFileHandler to log errors on to specific file but every midnight creates new file.

Loghelper.py
This is simple file, we are reading conf file here.

import logging
import logging.config

logging.config.fileConfig('logging.conf')


employee.py
Import loghelper, create logger object using getLogger and specify the name of the logger you want to use.

import loghelper
import logging
import traceback
import time

logger = logging.getLogger('filelogExample')

based on the config values all logs goes to python.log file.

logger = logging.getLogger('timedRotatingFileHandler')

exception logging
import loghelper
import logging
import traceback

logger.error("uncaught exception: %s", traceback.format_exc())
or
logger.exception("exception")

Uploaded files to github, check url here




Thursday, July 16, 2020

Web Scraping - website data scraping using BeautifulSoup


Recent days I worked on scarping data from multiple web sources. BeautifulSoup is useful library. Good thing is very easy to work also. If you have little idea about HTML then its very easy. Before exploring this package I also think that web scraping is very difficult. But once you have clear idea what you want and what format you want then its easy to get.

Step1: Import required packages
import requests
from bs4 import BeautifulSoup

Step2: Define the url which you want to gathere data from
base_site = "https://en.wikipedia.org/wiki/Music"

in my case, I gathered data from Wikipedia music site

Step3: Make a get request and check request success/fail by using status_code
response = requests.get(base_site)
response.status_code

Step4: Extract html content from response object
html = response.content

Step5: From now out BeatifulSoup journey starts
Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily. We can use different parsers like “html.parser”, “lxml”. I have written another article about parsers. Please read here

soup = BeautifulSoup(html, "html.parser")

Step6: Save html object into local file and explore the structure. This is optional step. You can do this on web page itself. This step you can decide based on requirement.

with open('Wiki_response.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))

# .prettify() modifies the HTML code with additional indentations for better readability

Step7: We mostly work on two methods “find” and “find_all”
Find returns fist occurrence of element in the markup, it returns None if tag not found in markup.

Here I am finding video tag in soup object

soup.find('video')

Find_all returns all occurrences in the entire markup. It returns list with all items
Here I am finding all anchor tags in soup object

links = soup.find_all('a')
  
different ways to get required element form soup object
1.     By using specific class attribute fetching anchor tags. It returns all anchor tags which had ‘mw-jump-link’ class. Class is special keyword so we have to use underscore.

soup.find_all('a', class_ = 'mw-jump-link')

2.     Used class and href attributes

soup.find('a', class_ = 'mw-jump-link', href = '#p-search')

3.     We can pass dictionary instead separate attributes

soup.find('a', attrs={ 'class':'mw-jump-link', 'href':'#p-search' })
soup.find('div', {'id' : 'footer'})


In my specific case, extracted content from multiple url’s. I written method to automate my requirement. Please check my github link below.

github: https://github.com/kvsivasankar/web-scraping/blob/master/webscraping2.ipynb





web scraping - Pulling data from public APIs


Sometimes we gather data form API's. JSON is the most common formats we receiving output.
Here I am showing how to call api's and handle errors.

step1: Import required packages
  • import requests
  • import json

step2: Using requests call api url. here "base_url" is web service url. 
We can make a GET request to this API endpoint with requests.get
  • base_url = "https://api.exchangeratesapi.io/latest"
  • response = requests.get(base_url)

This method returns the response from the server
We store this response in a variable for future processing

step3: Verify if the request went through ok or not
  • response.ok (It returns True if success)
  • response.status_code (for my case it returned 200). we can check status code meaning
  • Incase of error 400 code will be return
  • response.json() will be show error details

step4: Verify data
  • response.text ( returns output in string format)
  • response.content (returns the content of the response in byte format)



Uploaded work to github. Please find the link below.
github: https://github.com/kvsivasankar/web-scraping/blob/master/webscraping1.ipynb


Wednesday, July 15, 2020

Difference between "html.parser" vs "lxml"


1.     html.parser - BeautifulSoup(htmlmarkup, "html.parser")
    • Advantages: 
      • Batteries included
      • Decent speed
      • Built-in - no extra dependencies needed
      • Lenient (as of Python 2.7.3 and 3.2.)
    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

2.     lxml - BeautifulSoup(htmlmarkup, "lxml")
    • Advantages: 
      • Batteries included
      • Very fast
      • Lenient
      • Works well when webpage has broken HTML
    • Disadvantages: External C dependency




Documentation clearly mentioned. check this url here

Image noise comparison methods

 1. using reference image technique     - peak_signal_noise_ratio (PSNR)     - SSI 2. non-reference image technique     - BRISQUE python pac...