2024-03-25

Decoding Error of "Windows-1254" in Japanese during Python web scraping.

python - English

When decoding UTF-8 text,

it is being detected as Windows-1254,

resulting in garbled characters.

detect_enc = chardet.detect(temp)['encoding']

In that case,

you can add an additional condition to handle the case

when the text is mistakenly detected as Windows-1254.

elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')

Here is the improved version of the scraping part.

import chardet

def get_html_content(url):
html_content = ''
try:
print('fetching url', url)
q = Request(url)
html = urlopen(q, timeout=15)

temp = html.read()
detect_enc = chardet.detect(temp)['encoding']
if detect_enc is None:
detect_enc = 'utf-8'
elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')

except Exception as e:
print('fetching url failed', url, repr(e))

return html_content

2024-03-25

PythonのWebスクレイピング時に日本語で「Windows-1254」デコードエラー

python

UTF-8のテキストをデコードする時に、

Windows-1254と判定され、文字化けする現象。

detect_enc = chardet.detect(temp)['encoding']

仕方ないので、条件分岐をひとつ増やして対応。

elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')

以下が、スクレピング部分の改良版。

import chardet

def get_html_content(url):
html_content = ''
try:
print('fetching url', url)
q = Request(url)
html = urlopen(q, timeout=15)

temp = html.read()
detect_enc = chardet.detect(temp)['encoding']
if detect_enc is None:
detect_enc = 'utf-8'
elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')

except Exception as e:
print('fetching url failed', url, repr(e))

return html_content