My Tech Life

Memo by a Japanese Software Developer in his late 50s.

python - English

Python script to check the differences between the followers and followed blogs of a user on a Japanese blog site.

This program checks the differences between the favorite blogs and reader blogs of an user of Ameblo, one of the most famous Japanese blog site. Set the user in the following line: ameblo_user = 'XXXXXXXXXXXXXX' This script retrieves links…

How to group emails by week into arrays in Python

When conducting data analysis, deciding on the unit for aggregating data is crucial. Currently, in the midst of analyzing data obtained from bulk email collection, it seems beneficial to aggregate the data on a weekly basis. Here's a sampl…

Decoding Error of "Windows-1254" in Japanese during Python web scraping.

When decoding UTF-8 text, it is being detected as Windows-1254, resulting in garbled characters. detect_enc = chardet.detect(temp)['encoding'] In that case, you can add an additional condition to handle the case when the text is mistakenly…

Python Text Analysis Pre-Processing Chart

I think you can follow this flow to prepare for the data analysis with python. Once you reach the TextAnalyzer, various statistical analyses can be conducted.

An example of web scraping by a browser with Python.

Previously, I posted a web scraping example using urllib, but it becomes difficult to handle when there is a session involved. Using Selenium to utilize a browser makes it easier to handle sessions. In the following sample: You set URL, XP…

Fetching emails in bulk using Python (More improved version).

I've modularized the previous version of the email fetching script, focusing solely on message retrieval. I omitted the decoding and content processing, leaving those tasks to be handled in the main processing script if needed. I've also a…

How to read .mbox files in Python (Improved Version)

As mentioned in a previous article, when you use the conventional python library method of fetching mailboxes, it seems that parsing the entire file is required, so it can take a considerable amount of time. import mailbox mbox = mailbox.m…

Web Scraping with Python

With the web scraping method, you may want to collect a large amount of data independently and automate the collection. The most important thing is whether the website allows web scraping or not. You can find out by looking at the contents…

Counting the frequency of specific terms in emails with Python.

Assuming there is a file named "category_social.mbox" in the same folder, the following Python script counts the frequency of term occurrences using CountVectorizer, only for emails with the word "follow" in their titles. This serves as a …

Analyzing Japanese text in Python: Counting tokenized words.

Here are the minimum preprocessing steps required when performing text analysis in Python: Read the text Tokenize the text Count the tokenized words Below is a sample: First, prepare the data. Refer to a Wikipedia Japanese page and store i…

Japanese Text Processing - Tokenization in Python

When performing Japanese natural language processing in Python,I borrowed wisdom from senior bloggers, so I managed to create a process for tokenization using MeCab. I'll paste the source code at the end of this article. As a FYI in this b…

english post: 'cp932' codec can't encode character '\x##' in position ####: illegal multibyte sequence

'cp932' codec can't encode character '©' in position 3337: illegal multibyte sequence Exception commonly encountered when running Python on Japanese version of Windows, but almost never observed on Linux. Verify the point below before delv…

reading .mbox file in python

With a tendency to forget things often, I've decided to open a blog to keep future memos. Over the past decade or so, my Gmail account has accumulated a lot of emails, mostly advertisements and newsletters. With such a large collection, I …

decoding mail headers in python

Memo on decoding the email header. Using the sender/from address as an example. from email.header import decode_header decoded = decode_header(message['from']) outstr = decoded[0][0].decode(decoded[0][1]) # sample: decoded[0][0] # b'Linked…