Introducing Type Providers

abraham_fsharp_hiresmeap

Note: This article has been excerpted from my upcoming book, Learn F# . Save 37% off with code fccabraham!

Welcome to the world of data! This article will:

  • Gently introduce us to what type providers are
  • Get us up to speed with the most popular type provider, FSharp.Data.

What Are Type Providers?

Type Providers are a language feature first introduced in F#3.0:

An F# type provider is a component that provides types, properties, and methods for use in your program. Type providers are a significant part of F# 3.0 support for information-rich programming.

https://docs.microsoft.com/en-us/dotnet/articles/fsharp/tutorials/type-providers/index

At first glance, this sounds a bit fluffy – we already know what types, properties and methods are. And what does “information-rich programming” mean? Well, the short answer is to think of Type Providers as T4 Templates on steroids – that is, a form of code generation, but one that lives inside the F# compiler. Confused? Read on.

Understanding Type Providers

Let’s look at a somewhat holistic view of type providers first, before diving in and working with one to actually see what the fuss is all about. You’re already familiar with the notion of a compiler that parses C# (or F#) code and builds MSIL from which we can run applications, and if you’ve ever used Entity Framework (particularly the earlier versions) – or old-school SOAP web services in Visual Studio – you’ll be familiar with the idea of code generation tools such as T4 Templates or the like. These are tools which can generate C# code from another language or data source.

figure1
Figure 1 – Entity Framework Database-First code generation process

Ultimately, T4 Templates and the like, whilst useful, are somewhat awkward to use. For example, you need to attach them into the build system to get them up and running, and they use a custom markup language with C# embedded in them – they’re not great to work with or distribute.

At their most basic, Type Providers are themselves just F# assemblies (that anyone can write) that can be plugged into the F# compiler – and can then be used at edit-time to generate entire type systems for you to work with as you type. In a sense, Type Providers serve a similar purpose to T4 Templates, except they are much more powerful, more extensible, more lightweight to use, and are extremely flexible – they can be used with what I call “live” data sources, as well as offering a gateway not just to data sources but also to other programming languages.

figure2
Figure 2 – A set of F# Type Providers with supported data sources

Unlike T4 Templates, Type Providers can affect type systems without re-building the project, since they run in the background as you write code. There are dozens, if not hundreds of Type Providers out there, from working with simple flat files such as CSV, to SQL, to cloud-based data storage repositories such as Microsoft Azure Storage or Amazon Web Services S3. The term “information-rich programming” refers to the concept of bringing disparate data sources into the F# programming language in an extensible way.

Don’t worry if that sounds a little confusing – we’ll take a look at our first Type Provider in just a second.

Quick Check

  1. What is a Type Provider?
  2. How do type providers differ from T4 templates?
  3. Is the number of type providers fixed?

Working with our first Type Provider

Let’s look at a simple example of a data access challenge – working with some soccer results, except rather than work with an in-memory dataset, we’ll work with a larger, external data source – a CSV file that you can download at https://raw.githubusercontent.com/isaacabraham/learnfsharp/master/data/FootballResults.csv. You need to answer the following question: which three teams won at home the most over the whole season.

Working with CSV files today

Let’s first think about the typical process that you might use to answer this question: –

figure3
Figure 3 – Steps to parse a CSV file in order to perform a calculation on it.

Before we can even begin to perform the calculation, we first need to understand the data. This normally means looking at the source CSV file in something like Excel, and then designing a C# type to “match” the data in the CSV. Then, we do all of the usual boilerplate parsing – opening a handle to the file, skipping the header row, splitting on commas, pulling out the correct columns and parsing into the correct data types, etc. Only after doing all of that can you actually start to work with the data and produce something actually valuable. Most likely, you’ll use a console application to get the results, too. This process is more like typical software engineering – not a great fit when we want to explore some data quickly and easily.

Introducing FSharp.Data

We could quite happily do the above in F#; at least using the REPL affords us a more “exploratory” way of development. However, it wouldn’t remove the whole boilerplate element of parsing the file – and this is where our first Type Provider comes in – FSharp.Data.

FSharp.Data is an open source, freely distributable NuGet package which is designed to provide generated types when working with data in CSV, JSON, or XML formats. Let’s try it out with our CSV file.

Scripts for the win

At this point, I’m going to advise that you move away from heavyweight solutions and start to work exclusively with standalone scripts – this fits much better with what we’re going to be doing. You’ll notice a build.cmd file in the learnfsharp code repository (https://github.com/isaacabraham/learnfsharp). Run it – it uses Paket to download a number of NuGet packages into the packages folder, which you can reference directly from your scripts. This means we don’t need a project or solution to start coding – we can just create scripts and jump straight in. I’d recommend creating your scripts in the src/code-listings/ folder (or another folder at the same level, e.g. src/learning/) so that the package references shown in the listings here work without needing changes.

Now you try

  1. Create a new standalone script in Visual Studio using File -> New. You don’t need a solution here – remember that a script can work standalone.
  2. Save the newly created file into an appropriate location as described in “Scripts for the win”.
  3. Enter the following code from listing 1:
// Referencing the FSharp.Data assembly
#r @"..\..\packages\FSharp.Data\lib\net40\FSharp.Data.dll"
open FSharp.Data

// Connecting to the CSV file to provide types based on the supplied file
type Football = CsvProvider< @"..\..\data\FootballResults.csv">

// Loading in all data from the supplied CSV file
let data = Football.GetSample().Rows |> Seq.toArray

That’s it. You’ve now parsed the data, converted it into a type that you can consume from F# and loaded it into memory. Don’t believe me? Check this out: –

figure4
Figure 4 – Accessing a Provided Type from FSharp.Data

You now have full intellisense to the dataset – that’s it! You don’t have to manually parse the data set – that’s been done for you. You also don’t need to “figure out” the types – the Type Provider will scan through the first few rows and infer the types based on the contents of the file! In effect, this means that rather than using a tool such as Excel to “understand” the data, you can now begin to use F# as a tool to both understand and explore your data.

Backtick members

You’ll see from the screenshot above, as well as from the code when you try it out yourself, that the fields listed have spaces in them! It turns out that this isn’t actually a type provider feature, but one that’s available throughout F# called backtick members. Just place a double backtick (“) at the beginning and end of the member definition and you can put spaces, numbers or other characters in the member definition. Note that Visual Studio doesn’t correctly provide intellisense for these in all cases, e.g. let-bound members on modules, but it works fine on classes and records.

Whilst we’re at it, we’ll also bring down an easy-to-use F#-friendly charting library, XPlot. This library gives us access to charts available in Google Charts as well as Plotly. We’ll use the Google Charts API here, which means adding dependencies to XPlot.GoogleCharts (which also brings down the Google.DataTable.Net.Wrapper package).

  1. Add references to both the GoogleCharts and Google.DataTable.Net.Wrapper assemblies. If you’re using standalone scripts, both packages will be in the packages folder after running build.cmd – just use #r to reference the assembly inside one of the lib/net folders.
  2. Open up the GoogleCharts namespace.
  3. Execute the following code to calculate the result and plot them as a chart.
data
|> Seq.filter(fun row ->
    row.``Full Time Home Goals`` > row.``Full Time Away Goals``)
|> Seq.countBy(fun row -> row.``Home Team``)
|> Seq.sortByDescending snd
|> Seq.take 10
|> Chart.Column
|> Chart.Show
// countBy generates a sequence of tuples (team vs number of wins)
// Chart.Column converts the sequence of tuples into an XPlot Column Chart
// Chart.Show displays the chart in a browser window
figure5
Figure 5 – Visualising data sourced from the CSV Type Provider

In just a few lines of code, we were able to open up a CSV file we’ve never seen, explore the schema of it, perform some operations on it, and then chart it in less than 20 lines of code – not bad! This ability to rapidly work with and explore datasets that we’ve not even seen before, whilst still allowing us to interact with the full breadth of .NET libraries that are out there gives F# unparalleled abilities for bringing in disparate data sources to full-blown applications.

Type Erasure

The vast majority of type providers fall into the category of erasing type providers. The upshot of this is that the types generated by the provider exist only at compile time. At runtime, the types are erased and usually compile down to plain objects; if you try to use reflection over them, you won’t see the fields that you get in the code editor.

One of the downsides is that this makes them extremely difficult (if not impossible) to work with in C#. On the flip side, they are extremely efficient – you can use erasing type providers to create type systems with thousands of types without any runtime overhead, since at runtime they’re just of type Object.

Generative type providers allow for run-time reflection, but are much less commonly used (and from what I understand, much harder to develop).

If you want to know more, download the free first chapter of Learn F# and see this Slideshare presentation. Don’t forget to save 37% with code fccabraham.

Correlating emails and football matches


As those of you who know me even a little, you’ll know I’m a football (soccer to you Americans) fan. I subscribe to a mailing list in where fans of my team debate our team’s performances and generally how we’ve botched up another season by not buying the correct players, persisting with deadwood etc. etc..

I wanted to do some simple analysis of whether more of us are likely to email the list when we lose (presumably to complain) rather than when we win. Obviously, F# to the rescue! So, I’ll show you some of the tools I used to get this data analysis going, and what we can do with the data once we have it.

Sourcing the data

You can think of this problem as having two data sources: –

  • Emails – How many emails are sent on a given day (or hour)
  • Football results – What were the results of my team, classified into Win / Lose / Draw, and what date did the match happen on

Getting football results are easy – this site hosts them all in CSV format, and with a small amount of effort (to make sure the schema across different csv files are consistent), we can use the CSV Type Provider from FSharp.Data to read them in, and then simplify them into our cut down result model: –

Easy! What about emails? Well, to start with, we need to read them from our mail server so that we can do some more analysis in F#. The simplest way that I did that was to download them from my IMAP mail server, and then persist the results to disk in a big file. Of course, if you had lots and lots of results, you might want to use something called a “data base” which can apparently store lots of data and allow you to “query” it. But I didn’t want to learn about all that stuff, so a single flat file was fine for me.

So, we use S22.Imap (available on NuGet) to get to our emails. It’s actually pretty easy to do (especially when you can use F# scripts to explore the API step-by-step) – just three lines of code. As I download the messages (which can be pretty slow – maybe 10 messages / sec) I then push them to an agent that in the background serializes the data using FsPickler and writes to disk.

And that’s all that’s needed to get our data out. Now that we have our data, we can start to join the two together and visualise it. This is really where the power of F# comes in as a language that not only allows us to easily source data from multiple, disparate sources, using either built-in .NET classes, or F#-specific features like Type Providers, but now to start manipulating it, exploring it in a REPL and finally coming up with some sort of cogent analysis.

Exploring data in F#

The first thing I wanted to do was to overlay the two data sources on top of one another to see how peaks and troughs of email traffic related to football results. This is pretty simple: –

  • Get the number of emails per day for a given season
  • Get the dates of individual results for a given season
  • Overlay one on top of the other in a chart

FootballChart1

It’s easy to see that peaks in the amount of email traffic coincides with dates of football matches. We can then enhance this based on whether the team won, drew or lost to try to see if people are more likely to write an email to the list if e.g. your team were hammered 5-0 at home on a wet November Sunday evening, or if you won 1-0 against a mid-table side through a late penalty.

FootballChart2

What about identifying the most “popular” matches – ones that caused the most feedback. For this, I wanted to not just use emails on that day, but for emails for the next 48 hours. So we judge a match’s “popularity” by how many emails are sent for the day of the match plus the following two days. This is also rudimentary in F# using the seq.Window function: –

So now we can see for any given day how many emails were generated for that day and the next two, combined. Then it’s just a simple matter of joining the two datasets together, sorting the data in descending order, and charting it.

FootballChart3

Conclusion

We started with an IMAP mailbox and a few CSV files of football results; we were able to download emails into an F# record using just a few lines of code, and then persisted it to the file system easily. Then we were able to merge these two disparate datasets into a single coherent form through visualisation to see a clear correlation between football matches and emails sent on the same day.

It’s important to understand that a lot of the power of this comes from the REPL in F#; the ability to quickly try out some different data manipulation, change the structure of your output quickly, and rapidly visualise it. In addition, the language and collection libraries work in tandem with the REPL because they, too, are lightweight – there’s no boilerplate syntax, classes, curly braces etc. – I can simply start exploring the data, coming up with conclusions, and then when I’m done, perhaps I might push this into a fully-featured DLL that I run once a day to pull down the latest emails and update my charts with.