https://www.yahoo.co.jp/robots.txt
User-agent: *
https://news.yahoo.co.jp/robots.txt
User-agent: Google-Extended
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickupUser-agent: GPTBot
Allow: /articles/*/comments
Disallow: /articles
Disallow: /pickupUser-agent: *
Disallow: /comment/plugin/
Disallow: /comment/violation
Disallow: /profile/violation
Disallow: /polls/widgets/
Disallow: /articles/*/comments
Disallow: /articles/*/order
Disallow: /senkyo
Sitemap: https://news.yahoo.co.jp/sitemaps.xml
Sitemap: https://news.yahoo.co.jp/sitemaps/article.xml
Standard libraries, when used in the usual way,
should refer to the robots.txt file
and behave accordingly.
Even if web scraping is allowed,
it's considered good etiquette to include a time.sleep function
to avoid overloading the website.
For reference, Wikipedia provides downloadable
data for all its content, so
it's better not to scrape the website.
The following code sets multiple target URLs
in the target_sites list,
then retrieves the content, converts HTML to text
using BeautifulSoup,
converts it to utf-8, and prints it.
The content retrieval is executed at one-second intervals
using time.sleep(1).
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
import chardet
import timetarget_sites = ['https://www.yahoo.co.jp/','https://news.yahoo.co.jp/']
def get_html_content(url):
html_content = ''
try:
print('fetching url', url)
q = Request(url)
html = urlopen(q, timeout=15)temp = html.read()
detect_enc = chardet.detect(temp)['encoding']
if detect_enc is None:
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')except Exception as e:
print('fetching url failed', url, repr(e))return html_content
for site in target_sites:
html_content = get_html_content(site)
if not html_content:
break
else:
soup = BeautifulSoup(html_content, 'html.parser')
print(soup.get_text())
time.sleep(1)