My Tech Life

Memo by a Japanese Software Developer in his late 50s.

How to group emails by week into arrays in Python

When conducting data analysis,

deciding on the unit for aggregating data

is crucial.

 

Currently,

in the midst of analyzing data

obtained from bulk email collection,

it seems beneficial

to aggregate the data on a weekly basis.

 

Here's a sample scenario:

 

The input data is prepared as a two-dimensional array,

with the first element representing the date

and

the second element representing the sample text.

 

As preparation:

  • Format the dates and convert them into arrays.
  • Separate the texts into arrays.

 

Then,

determine the day of the week for each date,

and

create keys using the year-month combination

along with the day of the week,

and hash them.

 

The hash will have sub-keys for the date and text,

forming an array.

 

#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
from datetime import datetime

data = [
    ["Thu, 18 May 2017 02:24:34 +0000","This is mail 1"],
    ["Thu, 08 Jun 2017 23:13:22 +0000","This is mail 2"],
    ["Mon, 25 Dec 2017 03:12:33 +0000","This is mail 3"],
    ["Wed, 13 Sep 2017 22:46:44 -0700","This is mail 4"],
    ["Wed, 13 Sep 2017 20:11:18 -0700","This is mail 5"],
    ["Wed, 13 Sep 2017 10:19:19 -0700","This is mail 6"],
    ["Tue, 12 Sep 2017 22:59:06 -0700","This is mail 7"]
]

datetime_dates = [datetime.strptime(datum[0], "%a, %d %b %Y %H:%M:%S %z") for datum in data]
texts = [datum[1] for datum in data]

week_groups = { }
for (date, text) in zip(datetime_dates, texts):
    date_key = (date.year, date.month, date.isocalendar()[1])
    if date_key not in week_groups:
        week_groups[date_key] = { }
        week_groups[date_key]['dates'] = [ ]
        week_groups[date_key]['texts'] = [ ]
    week_groups[date_key]['dates'].append(date)
    week_groups[date_key]['texts'].append(text)

for key in sorted(week_groups.keys()):
    print(key)
    dates = week_groups[key]['dates']
    texts = week_groups[key]['texts']
    for (date, text) in zip(dates, texts):
        print(date, text)

 

Here's the result.

(base) C:\pytest>python test6.py
(2017, 5, 20)
2017-05-18 02:24:34+00:00 This is mail 1
(2017, 6, 23)
2017-06-08 23:13:22+00:00 This is mail 2
(2017, 9, 37)
2017-09-13 22:46:44-07:00 This is mail 4
2017-09-13 20:11:18-07:00 This is mail 5
2017-09-13 10:19:19-07:00 This is mail 6
2017-09-12 22:59:06-07:00 This is mail 7
(2017, 12, 52)
2017-12-25 03:12:33+00:00 This is mail 3