When decoding UTF-8 text,
it is being detected as Windows-1254,
resulting in garbled characters.
detect_enc = chardet.detect(temp)['encoding']
In that case,
you can add an additional condition to handle the case
when the text is mistakenly detected as Windows-1254.
elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')
Here is the improved version of the scraping part.
import chardet
def get_html_content(url):
html_content = ''
try:
print('fetching url', url)
q = Request(url)
html = urlopen(q, timeout=15)temp = html.read()
detect_enc = chardet.detect(temp)['encoding']
if detect_enc is None:
detect_enc = 'utf-8'
elif detect_enc == 'Windows-1254':
detect_enc = 'utf-8'
html_content = temp.decode(detect_enc, errors='ignore')except Exception as e:
print('fetching url failed', url, repr(e))return html_content