My Tech Life

Memo by a Japanese Software Developer in his late 50s.

An example of web scraping by a browser with Python.

Previously, I posted a web scraping example using urllib,

but it becomes difficult to handle

when there is a session involved.

 

Using Selenium to utilize a browser

makes it easier to handle sessions.

 

In the following sample:

You set URL, XPATH, and search string in the xpath_list.

 

Loop through xpath_list,

get HTML through the browser,

and convert the returned HTML to text using BS4.

 

A common way to obtain XPATH for setting in the list is

to use the web development tool provided with the browser,

specify the desired position,

and copy the XPATH from the copy menu.

 

#!/usr/bin/env python3
# -*- coding: utf8 -*-

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import re
from bs4 import BeautifulSoup

# URL, Xpath, QueryString
xpath_list = [
        ['https://www.yahoo.co.jp', '/html/body/div/div/header/section[1]/div/form/fieldset/span/input', 'Selenium'],
        ['https://dev.to', '/html/body/header/div/div[1]/form/div/div/input', 'Selenium']
]

for elem in xpath_list:
    driver = webdriver.Firefox()
    # URL
    driver.get(elem[0])

    # Xpath
    xpath = elem[1]
    m = driver.find_element(by=By.XPATH, value=xpath)

    # QueryString
    m.send_keys(elem[2])
    m.send_keys(Keys.ENTER)

    html = driver.page_source
    time.sleep(1)

    soup = BeautifulSoup(html, 'html.parser')
    text_source = soup.get_text()
    for l in text_source.split('\n'):
        # skip empty line
        if re.search(r'^\s*$', l):
            continue
        else:
            print(l)

    driver.quit()
    time.sleep(1)