Webスクレイピング

Webスクレイピングで、

たくさんのデータを独自に収集して、

収集を自動化したい。

一番大事なのは、

サイト側がWebスクレイピングを許可しているかどうか。

サイトのルートディレクトリにある

「robots.txt」の中身を見てみることで分かる。

https://www.yahoo.co.jp/robots.txt

User-agent: *

https://news.yahoo.co.jp/robots.txt

User-agent: Google-Extended
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickup

User-agent: GPTBot
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickup

User-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation
Disallow: /profile/violation
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Disallow: /articles/*/order
Disallow: /senkyo
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml

標準的なライブラリは、通常の使い方をすれば、

robots.txtを参照して、それに応じた動きをしてくれるはず。

Webスクレイピングが許可されていても、

time.sleepを入れたりして、サイト側に負荷をかけないのがマナー。

参考までに、Wkipediaは、ダウンロード可能な全データが用意されているので、

Webスクレイピングはしないほうがいい。

以下のコードは、target_sitesリストに対象のURLを複数登録し、

コンテンツを取得したら、BeautifulSoupでHTML⇒テキスト変換、

utf-8変換して、print表示している。

time.sleep(1)で、コンテンツ取得は一秒間隔で実行する。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import chardet
import time

target_sites = ['https://www.yahoo.co.jp/','https://news.yahoo.co.jp/']

def get_html_content(url):
html_content = ''
try:
print('fetching url', url)
q = Request(url)
html = urlopen(q, timeout=15)

temp = html.read()
detect_enc = chardet.detect(temp)['encoding']
if detect_enc is None:
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')

except Exception as e:
print('fetching url failed', url, repr(e))

return html_content

for site in target_sites:
html_content = get_html_content(site)
if not html_content:
break
else:
soup = BeautifulSoup(html_content, 'html.parser')
print(soup.get_text())
time.sleep(1)

My Tech Life

Memo by a Japanese Software Developer in his late 50s.

Webスクレイピング