My Tech Life

Memo by a Japanese Software Developer in his late 50s.

Decoding Error of "Windows-1254" in Japanese during Python web scraping.

When decoding UTF-8 text,

it is being detected as Windows-1254,

resulting in garbled characters.

 

detect_enc = chardet.detect(temp)['encoding']

 

In that case,

you can add an additional condition to handle the case

when the text is mistakenly detected as Windows-1254.

 

    elif detect_enc == 'Windows-1254':
      detect_enc = 'utf-8'
    html_content = temp.decode(detect_enc, errors='ignore')

 

Here is the improved version of the scraping part.

 

import chardet

def get_html_content(url):
  html_content = ''
  try:
    print('fetching url', url)
    q = Request(url)
    html = urlopen(q, timeout=15)

    temp = html.read()
    detect_enc = chardet.detect(temp)['encoding']
    if detect_enc is None:
      detect_enc = 'utf-8'
    elif detect_enc == 'Windows-1254':
      detect_enc = 'utf-8'
    html_content = temp.decode(detect_enc, errors='ignore')

  except Exception as e:
    print('fetching url failed', url, repr(e))

  return html_content