As those of you who know me even a little, you’ll know I’m a football (soccer to you Americans) fan. I subscribe to a mailing list in where fans of my team debate our team’s performances and generally how we’ve botched up another season by not buying the correct players, persisting with deadwood etc. etc..
I wanted to do some simple analysis of whether more of us are likely to email the list when we lose (presumably to complain) rather than when we win. Obviously, F# to the rescue! So, I’ll show you some of the tools I used to get this data analysis going, and what we can do with the data once we have it.
Sourcing the data
You can think of this problem as having two data sources: –
- Emails – How many emails are sent on a given day (or hour)
- Football results – What were the results of my team, classified into Win / Lose / Draw, and what date did the match happen on
Getting football results are easy – this site hosts them all in CSV format, and with a small amount of effort (to make sure the schema across different csv files are consistent), we can use the CSV Type Provider from FSharp.Data to read them in, and then simplify them into our cut down result model: –
Easy! What about emails? Well, to start with, we need to read them from our mail server so that we can do some more analysis in F#. The simplest way that I did that was to download them from my IMAP mail server, and then persist the results to disk in a big file. Of course, if you had lots and lots of results, you might want to use something called a “data base” which can apparently store lots of data and allow you to “query” it. But I didn’t want to learn about all that stuff, so a single flat file was fine for me.
So, we use S22.Imap (available on NuGet) to get to our emails. It’s actually pretty easy to do (especially when you can use F# scripts to explore the API step-by-step) – just three lines of code. As I download the messages (which can be pretty slow – maybe 10 messages / sec) I then push them to an agent that in the background serializes the data using FsPickler and writes to disk.
And that’s all that’s needed to get our data out. Now that we have our data, we can start to join the two together and visualise it. This is really where the power of F# comes in as a language that not only allows us to easily source data from multiple, disparate sources, using either built-in .NET classes, or F#-specific features like Type Providers, but now to start manipulating it, exploring it in a REPL and finally coming up with some sort of cogent analysis.
Exploring data in F#
The first thing I wanted to do was to overlay the two data sources on top of one another to see how peaks and troughs of email traffic related to football results. This is pretty simple: –
- Get the number of emails per day for a given season
- Get the dates of individual results for a given season
- Overlay one on top of the other in a chart
It’s easy to see that peaks in the amount of email traffic coincides with dates of football matches. We can then enhance this based on whether the team won, drew or lost to try to see if people are more likely to write an email to the list if e.g. your team were hammered 5-0 at home on a wet November Sunday evening, or if you won 1-0 against a mid-table side through a late penalty.
What about identifying the most “popular” matches – ones that caused the most feedback. For this, I wanted to not just use emails on that day, but for emails for the next 48 hours. So we judge a match’s “popularity” by how many emails are sent for the day of the match plus the following two days. This is also rudimentary in F# using the seq.Window function: –
So now we can see for any given day how many emails were generated for that day and the next two, combined. Then it’s just a simple matter of joining the two datasets together, sorting the data in descending order, and charting it.
We started with an IMAP mailbox and a few CSV files of football results; we were able to download emails into an F# record using just a few lines of code, and then persisted it to the file system easily. Then we were able to merge these two disparate datasets into a single coherent form through visualisation to see a clear correlation between football matches and emails sent on the same day.
It’s important to understand that a lot of the power of this comes from the REPL in F#; the ability to quickly try out some different data manipulation, change the structure of your output quickly, and rapidly visualise it. In addition, the language and collection libraries work in tandem with the REPL because they, too, are lightweight – there’s no boilerplate syntax, classes, curly braces etc. – I can simply start exploring the data, coming up with conclusions, and then when I’m done, perhaps I might push this into a fully-featured DLL that I run once a day to pull down the latest emails and update my charts with.