Wednesday, November 23, 2011

iMapBox

I've always been the type who, when confronted with a one-hour task, will instead take two hours to automate it. Here's an example.

VCdelta is my bot that tracks additions to VC portfolio pages. It has its own twitter feed. Its twitter feed is about to surpass my twitter feed in number of followers. It seems my bot is more interesting than I am. I thought it would be interesting to graph the number of people who have followed me versus the number of people who have followed VCdelta over time. Twitter does not provide stats like that, but whenever I get a follow email from Twitter, I hit archive, not delete. So all I needed to do was count the follow emails by month.

Turns out Python doesn't have a very good library for using a mailbox as a data source. The Python email libraries assume you are planning on writing an email client. So I wrote an abstraction layer for the Python IMAP library. Code is here*.

Here's the code to count twitter followers:

import IMapBox 

me=IMapBox.IMapBox("imap.gmail.com",my_acct,my_pwd)
mymail=me["[Gmail]/All Mail"]

myfollows=mymail.frm("twitter").subject("following")

mydates=[myfollows[x]['date'] for x in myfollows]

The 'me=' and 'mymail=' open a connection to my email account and select a mailbox, in this case the All Mail mailbox. (The command 'me.list()' lists all the mailboxes for the account.)

The next line filters mymail so myfollows is only emails from Twitter that have 'following' in the subject line**. iMapBox is lazy--it doesn't fetch the emails itself until it has to--so this is pretty fast. myfollows acts like a dictionary, so you can len() it, ask for the keys()--these would be the message IDs--or the items(), iterate over it, or get items.

Each of the items in the dictionary is an email message. These also act like dictionaries, with keys like 'from','to','subject','date', and 'text'. The next line creates a list called mydates of the date each follow email was sent. It does this by iterating over each item in myfollows and pulling its date out. This is the slower part: when you set up an iterator, iMapBox gets all the headers***.

The part about counting follows per date I will leave as an exercise to the reader. Here's the graph of my follows and VCdelta's follows. I've been tweeting for some three years, VCdelta for six months.


On a sidenote, this is a logarithmic scale. The green line is my trend. This is odd, no? I mean, I'm not getting exponentially more popular, so this argues that a lot of follow behavior is algorithmic of some sort. I had expected more linear growth.  I also expect VCdelta to level out soon, as it reaches the limits of its natural audience.

Another example, email volume over time:



You can see where I started using my current email account full-time, in September 2006. And you can see when I started investing full-time, in mid-2009. And you can see why my email response time has slowed dramatically.

The code:

from datetime import date, timedelta
import IMapBox

me=IMapBox.IMapBox("imap.gmail.com",my_acct,my_pwd)
mymail=me["[Gmail]/All Mail"]

for yr in range(2006,2012):
 for mo in range(1,13):
  beg_month = date(yr,mo,1)
  end_month = date(yr+mo//12,mo%12+1,1)-timedelta(days=1)
  print mo,"/",yr,"\t",len(mymail.dates(beg_month,end_month))

This is an alternative way to count emails per month, filtering by date instead of collecting dates. The 'dates(x,y)' method filters the emails for only those that were received between date x and date y (inclusive.) This is faster because even the headers are never fetched.

Some other ways to use it:

c=mymail.frm('josh')+mymail.frm('matt')
d=mymail.frm('josh')-mymail.to('matt')
e=mymail.today()
f=-mymail.today()

The first is all messages from either Josh or Matt. The second is all messages from Josh that aren't also to Matt, the third is all today's messages, the fourth is all messages except today's.

 ----- 
 * I'm an electrical engineer, not a computer scientist. So I can build a waveguide to your specifications, but I'm not entirely sure that this code is all that good. Please, feel free to fork, suggest improvements, make improvements, tutor me on garbage collection or unit testing, whatever. 
 ** I like object chaining. I know it's not Pythonic, but I'm not sure why. It strikes me that since I don't really understand too deeply how Python garbage collects, that this may be creating extraneous intermediate objects. If you plan to use this is any sort of real code, you might want to figure that out. I did notice that if I object-chain the IMAP connection ('me' in this example), it gets dereferenced and gc'd, which invoked the very polite __del__ method, closing the connection. I'm not sure how to avoid that, so I just commented out the __del__ method, leaving a messy open connection to the server. 
*** My thinking is to only go do the time-consuming fetching of messages when needed: when an email message object is referenced or when an iterator is set up (on the assumption that when you set up an iterator, you plan to consume the whole set of messages.) This latter is because fetching 100 messages in a single fetch is far faster than 100 single message fetches. The default is to only fetch the headers, except when the text itself is explicitly asked for. This default can be changed by setting priority='both' or priority='text' when you call iMapBox to open a connection to the server.