My Tech Life

Memo by a Japanese Software Developer in his late 50s.

How to read .mbox files in Python (Improved Version)

As mentioned in a previous article,

when you use the conventional python library method of fetching mailboxes,

it seems that parsing the entire file is required,

so it can take a considerable amount of time.

 

import mailbox

mbox = mailbox.mbox('example.mbox')

for message in mbox:
    print("Subject:", message['subject'])

mbox.close()

 

I asked ChatGPT to improve the script to read and output data sequentially,

but with the outputted source code,

the message object is not being correctly instantiated.

 

                message = email.message_from_bytes(b'\n'.join(lines), policy=default)

 

My debugging shows that:

Each element in the input lines already contains "\r\n" (line breaks),

and further joining them with "\n" (line breaks)

caused empty lines to be created for each line,

resulting in the mail header being completed on the first line.

 

Therefore, I made the following correction.

 

            # ***** modification starts here *****
            #lines.append(line)
            lines.append(line.rstrip(b'\r\n'))
            # ***** modification ends here *****

 

This is a bug that is only apparent

if you understand that "the mail header extends until the first empty line."

 

Below is the full code of the corrected version.

It operates very efficiently by displaying the contents of the read mail.

 

Let's further investigate about the mail analysis

based on this source code.

 

#!/usr/bin/env python3
# -*- encoding: utf-8 -*-

mboxfilename = 'category_social.mbox'

import email
from email.policy import default
from email.header import decode_header

class MboxReader:
    def __init__(self, filename):
        self.handle = open(filename, 'rb')

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, exc_traceback):
        self.handle.close()

    def __iter__(self):
        return self

    def __next__(self):
        lines = [ ]
        while True:
            line = self.handle.readline()
            if line == b'' or line.startswith(b'From '):
                message = email.message_from_bytes(b'\n'.join(lines), policy=default)
                if message:
                    subject = message['subject']
                    if subject:
                        decoded_subject = decode_header(subject)[0]
                        if decoded_subject[1]:
                            subject = decoded_subject[0].decode(decoded_subject[1])
                    date = message['Date']
                    payload = message.get_payload()
                    if isinstance(payload, list):
                        payload = payload[0].get_payload()  # Handle multipart messages
                    return subject, payload, date
                if line == b'':
                    raise StopIteration
                lines = []
                continue
            # ***** modification starts here *****
            #lines.append(line)
            lines.append(line.rstrip(b'\r\n'))
            # ***** modification ends here *****

with MboxReader(mboxfilename) as mbox:
    for i, (subject, payload, date) in enumerate(mbox):
        #if i > 100: break

        #print("Subject:", subject)
        print(i, date, subject)
        #print("Body:", payload)