Correlating emails and football matches


As those of you who know me even a little, you’ll know I’m a football (soccer to you Americans) fan. I subscribe to a mailing list in where fans of my team debate our team’s performances and generally how we’ve botched up another season by not buying the correct players, persisting with deadwood etc. etc..

I wanted to do some simple analysis of whether more of us are likely to email the list when we lose (presumably to complain) rather than when we win. Obviously, F# to the rescue! So, I’ll show you some of the tools I used to get this data analysis going, and what we can do with the data once we have it.

Sourcing the data

You can think of this problem as having two data sources: –

  • Emails – How many emails are sent on a given day (or hour)
  • Football results – What were the results of my team, classified into Win / Lose / Draw, and what date did the match happen on

Getting football results are easy – this site hosts them all in CSV format, and with a small amount of effort (to make sure the schema across different csv files are consistent), we can use the CSV Type Provider from FSharp.Data to read them in, and then simplify them into our cut down result model: –

Easy! What about emails? Well, to start with, we need to read them from our mail server so that we can do some more analysis in F#. The simplest way that I did that was to download them from my IMAP mail server, and then persist the results to disk in a big file. Of course, if you had lots and lots of results, you might want to use something called a “data base” which can apparently store lots of data and allow you to “query” it. But I didn’t want to learn about all that stuff, so a single flat file was fine for me.

So, we use S22.Imap (available on NuGet) to get to our emails. It’s actually pretty easy to do (especially when you can use F# scripts to explore the API step-by-step) – just three lines of code. As I download the messages (which can be pretty slow – maybe 10 messages / sec) I then push them to an agent that in the background serializes the data using FsPickler and writes to disk.

And that’s all that’s needed to get our data out. Now that we have our data, we can start to join the two together and visualise it. This is really where the power of F# comes in as a language that not only allows us to easily source data from multiple, disparate sources, using either built-in .NET classes, or F#-specific features like Type Providers, but now to start manipulating it, exploring it in a REPL and finally coming up with some sort of cogent analysis.

Exploring data in F#

The first thing I wanted to do was to overlay the two data sources on top of one another to see how peaks and troughs of email traffic related to football results. This is pretty simple: –

  • Get the number of emails per day for a given season
  • Get the dates of individual results for a given season
  • Overlay one on top of the other in a chart

FootballChart1

It’s easy to see that peaks in the amount of email traffic coincides with dates of football matches. We can then enhance this based on whether the team won, drew or lost to try to see if people are more likely to write an email to the list if e.g. your team were hammered 5-0 at home on a wet November Sunday evening, or if you won 1-0 against a mid-table side through a late penalty.

FootballChart2

What about identifying the most “popular” matches – ones that caused the most feedback. For this, I wanted to not just use emails on that day, but for emails for the next 48 hours. So we judge a match’s “popularity” by how many emails are sent for the day of the match plus the following two days. This is also rudimentary in F# using the seq.Window function: –

So now we can see for any given day how many emails were generated for that day and the next two, combined. Then it’s just a simple matter of joining the two datasets together, sorting the data in descending order, and charting it.

FootballChart3

Conclusion

We started with an IMAP mailbox and a few CSV files of football results; we were able to download emails into an F# record using just a few lines of code, and then persisted it to the file system easily. Then we were able to merge these two disparate datasets into a single coherent form through visualisation to see a clear correlation between football matches and emails sent on the same day.

It’s important to understand that a lot of the power of this comes from the REPL in F#; the ability to quickly try out some different data manipulation, change the structure of your output quickly, and rapidly visualise it. In addition, the language and collection libraries work in tandem with the REPL because they, too, are lightweight – there’s no boilerplate syntax, classes, curly braces etc. – I can simply start exploring the data, coming up with conclusions, and then when I’m done, perhaps I might push this into a fully-featured DLL that I run once a day to pull down the latest emails and update my charts with.

Advertisements

One thought on “Correlating emails and football matches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s