python - English
This program checks the differences between the favorite blogs and reader blogs of an user of Ameblo, one of the most famous Japanese blog site. Set the user in the following line: ameblo_user = 'XXXXXXXXXXXXXX' This script retrieves links…
When conducting data analysis, deciding on the unit for aggregating data is crucial. Currently, in the midst of analyzing data obtained from bulk email collection, it seems beneficial to aggregate the data on a weekly basis. Here's a sampl…
When decoding UTF-8 text, it is being detected as Windows-1254, resulting in garbled characters. detect_enc = chardet.detect(temp)['encoding'] In that case, you can add an additional condition to handle the case when the text is mistakenly…
I think you can follow this flow to prepare for the data analysis with python. Once you reach the TextAnalyzer, various statistical analyses can be conducted.
Previously, I posted a web scraping example using urllib, but it becomes difficult to handle when there is a session involved. Using Selenium to utilize a browser makes it easier to handle sessions. In the following sample: You set URL, XP…
I've modularized the previous version of the email fetching script, focusing solely on message retrieval. I omitted the decoding and content processing, leaving those tasks to be handled in the main processing script if needed. I've also a…
As mentioned in a previous article, when you use the conventional python library method of fetching mailboxes, it seems that parsing the entire file is required, so it can take a considerable amount of time. import mailbox mbox = mailbox.m…
With the web scraping method, you may want to collect a large amount of data independently and automate the collection. The most important thing is whether the website allows web scraping or not. You can find out by looking at the contents…
Assuming there is a file named "category_social.mbox" in the same folder, the following Python script counts the frequency of term occurrences using CountVectorizer, only for emails with the word "follow" in their titles. This serves as a …
Here are the minimum preprocessing steps required when performing text analysis in Python: Read the text Tokenize the text Count the tokenized words Below is a sample: First, prepare the data. Refer to a Wikipedia Japanese page and store i…
When performing Japanese natural language processing in Python,I borrowed wisdom from senior bloggers, so I managed to create a process for tokenization using MeCab. I'll paste the source code at the end of this article. As a FYI in this b…
'cp932' codec can't encode character '©' in position 3337: illegal multibyte sequence Exception commonly encountered when running Python on Japanese version of Windows, but almost never observed on Linux. Verify the point below before delv…
With a tendency to forget things often, I've decided to open a blog to keep future memos. Over the past decade or so, my Gmail account has accumulated a lot of emails, mostly advertisements and newsletters. With such a large collection, I …
Memo on decoding the email header. Using the sender/from address as an example. from email.header import decode_header decoded = decode_header(message['from']) outstr = decoded[0][0].decode(decoded[0][1]) # sample: decoded[0][0] # b'Linked…