My Tech Life

Memo by a Japanese Software Developer in his late 50s.

Web Scraping with Python

With the web scraping method,

you may want to collect a large amount of data independently

and automate the collection.

 

The most important thing is

whether the website allows web scraping or not.

 

You can find out by looking at the contents of

"robots.txt" located in the root directory of the site.

 

https://www.yahoo.co.jp/robots.txt

User-agent: *

 

https://news.yahoo.co.jp/robots.txt

User-agent: Google-Extended
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickup

User-agent: GPTBot
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickup

User-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation
Disallow: /profile/violation
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Disallow: /articles/*/order
Disallow: /senkyo
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml

 

Standard libraries, when used in the usual way,

should refer to the robots.txt file

and behave accordingly.

 

Even if web scraping is allowed,

it's considered good etiquette to include a time.sleep function

to avoid overloading the website.

 

For reference, Wikipedia provides downloadable

data for all its content, so

it's better not to scrape the website.

 

The following code sets multiple target URLs

in the target_sites list,

 

then retrieves the content, converts HTML to text

using BeautifulSoup,

converts it to utf-8, and prints it.

 

The content retrieval is executed at one-second intervals

using time.sleep(1).

 

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import chardet
import time

target_sites = ['https://www.yahoo.co.jp/','https://news.yahoo.co.jp/']

def get_html_content(url):
  html_content = ''
  try:
    print('fetching url', url)
    q = Request(url)
    html = urlopen(q, timeout=15)

    temp = html.read()
    detect_enc = chardet.detect(temp)['encoding']
    if detect_enc is None:
      detect_enc = 'utf-8'
    html_content = temp.decode(detect_enc, errors='ignore')

  except Exception as e:
    print('fetching url failed', url, repr(e))

  return html_content

for site in target_sites:
  html_content = get_html_content(site)
  if not html_content:
    break
  else:
    soup = BeautifulSoup(html_content, 'html.parser')
    print(soup.get_text())
  time.sleep(1)