Having spent a while using Hadoop on HDInsight now, I wanted to look at writing Hadoop mapper and reducers in F#. There are several reasons for this as opposed to other languages such as Java, Python and C#. I’m not going to go into all of the usual features of F# over other languages, but the main reason is because F# lets you just “get on” with dealing with data. That’s what one of it’s main strengths, in my opinion, and is what most map-reduce jobs are about.
There’s already a .NET SDK for Hadoop that Microsoft have released. However, it does have some issues with it, not just in terms of functionality but also in terms of how well it maps with F#. The main problem that I have with it is that you write your code in an object hierarchy, inheriting from MapperBase or ReducerCombinerBase. You then have to mutate the Context that’s passed in with any outputs from your Mapper or Reducer.
I wanted something that was a bit more lightweight, and also allowed me to explore creating a parser from the Streaming Hadoop inputs. So, I’ve now put HadoopFs on GitHub, with the intention to put it on NuGet in the short term future. The main things is gives you is the ability to write mapper and reducers very easily without the need to “inherit” from any classes or anything, and also a flexible IO mechanism, so you can pipe data in or out from the “real” console (for use with the real Hadoop), file system or in-memory lists etc. (essentially anything that can be used to generate a sequence of strings). So the prototypical wordcount map / reduce looks like this: -
Three lines for the mapper (including function declaration) and four lines for the reducer. Nice. Notice that you do not need to have any dependency on HadoopFs to write your map / reduce code. It’s just a couple of arbitrary functions, which has several benefits. Firstly, it’s more accessible than having to understand a “framework” – all you have to do is understand the Hadoop MR paradigm and you’re good to go. Secondly, it’s easier to test – you can always much more easily test a pure function than something which involves e.g. mutating state of some “context” object that you need to create and provide.
The only times you use the HadoopFs types and functions is when plugging in your MR code into an executable for use with Hadoop:-
You can see from the last example how you can essentially plug in any input / output source e.g. file system or console etc.. This is very useful for e.g. unit testing as you can simply provide an in-memory list of strings and get back the output from a full map-reduce.
I still have some more work to do on it – some cleaning up of the function signatures for consistency etc., and there’s no doubt some extra corner cases to deal with, but as an experiment in doing this in a day or so, it was a good learning exercise in Hadoop streaming. Indeed, the hardest part was actually in generating a lazy group of key/values for the reduce from a flat list of sorted input rows. I’d also like to write a generic MapReduce executable that can be parameterised for the mapper or reducer that you need.
All said though, considering the entire framework including test helper classes is less than 150 lines of code, it’s quite nice I think.
In the words of Professor Farnsworth – Good news everybody! I’ve finally gotten around to looking at adding some basic Azure Table Storage support to the Azure Type Provider.
Why Table Storage?
There are some difficulties with interacting with Azure Table Storage through the native .NET API, some of which impacts how useful (or not) the Type Provider can be, and some of which the Type Provider can help with: -
- The basic API gives you back an IQueryable, but you can only use Where, Take and First. Any other calls will give a runtime exception
- You can write arbitrary queries against a table with the above restriction, but this will invoke be a full table scan
- The quickest way of getting an entity is by the Partition and Entity keys, otherwise you’ll effectively initiate a full (or at best, a partial) table scan
- You can’t get the number of rows in a table without iterating through every row
- You can’t get a complete list of partitions in a table without iterating through every row
- There’s no fixed schema. You can create your own types, but these need to inherit from Table Entity. Alternatively, you can use the DynamicTableEntity to give you key/value pair access to every row; however, accessing values of an entity is a pain as you must pick a specific “getter” e.g. ValueAsBoolean or ValueAsString.
So, how does the Type Provider help you?
Well, first, you’ll automatically get back the list of tables in your storage account, for free. On dotting to a table, the provider will automatically infer the schema based upon the first x number of rows (currently I’ve set this to 20 rows) and will automatically generate the entity type.
How do we do this? Well, a table collection doesn’t have a schema that all rows must conform to, but what you do get on each cell of each entity returned is metadata including the type which can be mapped to regular .NET types; this is made easier when using the DynamicTableEntity. The generated properties in the Type Provider will use the EDM data from the row to get the data back as the correct type e.g. String, Int32 etc. etc.. and will collate different entities in the same table as a single merged entity which is the sum of both shapes.
Once this is done, you can pull back all the rows from a specific table partition into memory and then query it to your hearts content. Here’s a little sample to get you started – imagine a table as follows: -
Then with the Azure Type Provider you can do as follows: -
- The good: player is strongly typed, down to the fact that the Cost property is a float option (not a string or object).
- The ugly: You have to explicitly supply the Partition key as plain text. There’s no easy way to get a complete list of all partitions, although I am hoping to at least suggest some partition keys based on e.g. first 100 rows.
What doesn’t it do (yet)?
- You currently can’t write arbitrary queries to execute on the server. You can pull back all the entities for a particular partition key, but that’s it, nor can you specify a limit on how many entities to bring back. I want to look at ways that you can create query expressions over these provided types, or at least ways you can create “weak” queries (off of the standard CreateQuery() call) and then pipe that into the provider
- All properties of all entities are option types. This is not so different from the real underlying Table Storage fields in a Dynamic Table Entity, which are returned as nullables for all value types (the only EDM reference type is String), and is in part because there’s no way to easily know whether any column is optional or not, but I would like to give the option for a user to say that they want e.g. all fields to not be option types and to e.g. return default(T) or throw an exception instead
- You can’t search an individual entity by Entity Key (yet)
- You can’t download an entire table as a CSV yet – but you will be able to shortly
- No write support
- No async support (yet)
As those of you who know me even a little, you’ll know I’m a football (soccer to you Americans) fan. I subscribe to a mailing list in where fans of my team debate our team’s performances and generally how we’ve botched up another season by not buying the correct players, persisting with deadwood etc. etc..
I wanted to do some simple analysis of whether more of us are likely to email the list when we lose (presumably to complain) rather than when we win. Obviously, F# to the rescue! So, I’ll show you some of the tools I used to get this data analysis going, and what we can do with the data once we have it.
Sourcing the data
You can think of this problem as having two data sources: -
- Emails – How many emails are sent on a given day (or hour)
- Football results – What were the results of my team, classified into Win / Lose / Draw, and what date did the match happen on
Getting football results are easy – this site hosts them all in CSV format, and with a small amount of effort (to make sure the schema across different csv files are consistent), we can use the CSV Type Provider from FSharp.Data to read them in, and then simplify them into our cut down result model: -
Easy! What about emails? Well, to start with, we need to read them from our mail server so that we can do some more analysis in F#. The simplest way that I did that was to download them from my IMAP mail server, and then persist the results to disk in a big file. Of course, if you had lots and lots of results, you might want to use something called a “data base” which can apparently store lots of data and allow you to “query” it. But I didn’t want to learn about all that stuff, so a single flat file was fine for me.
So, we use S22.Imap (available on NuGet) to get to our emails. It’s actually pretty easy to do (especially when you can use F# scripts to explore the API step-by-step) – just three lines of code. As I download the messages (which can be pretty slow – maybe 10 messages / sec) I then push them to an agent that in the background serializes the data using FsPickler and writes to disk.
And that’s all that’s needed to get our data out. Now that we have our data, we can start to join the two together and visualise it. This is really where the power of F# comes in as a language that not only allows us to easily source data from multiple, disparate sources, using either built-in .NET classes, or F#-specific features like Type Providers, but now to start manipulating it, exploring it in a REPL and finally coming up with some sort of cogent analysis.
Exploring data in F#
The first thing I wanted to do was to overlay the two data sources on top of one another to see how peaks and troughs of email traffic related to football results. This is pretty simple: -
- Get the number of emails per day for a given season
- Get the dates of individual results for a given season
- Overlay one on top of the other in a chart
It’s easy to see that peaks in the amount of email traffic coincides with dates of football matches. We can then enhance this based on whether the team won, drew or lost to try to see if people are more likely to write an email to the list if e.g. your team were hammered 5-0 at home on a wet November Sunday evening, or if you won 1-0 against a mid-table side through a late penalty.
What about identifying the most “popular” matches – ones that caused the most feedback. For this, I wanted to not just use emails on that day, but for emails for the next 48 hours. So we judge a match’s “popularity” by how many emails are sent for the day of the match plus the following two days. This is also rudimentary in F# using the seq.Window function: -
So now we can see for any given day how many emails were generated for that day and the next two, combined. Then it’s just a simple matter of joining the two datasets together, sorting the data in descending order, and charting it.
We started with an IMAP mailbox and a few CSV files of football results; we were able to download emails into an F# record using just a few lines of code, and then persisted it to the file system easily. Then we were able to merge these two disparate datasets into a single coherent form through visualisation to see a clear correlation between football matches and emails sent on the same day.
It’s important to understand that a lot of the power of this comes from the REPL in F#; the ability to quickly try out some different data manipulation, change the structure of your output quickly, and rapidly visualise it. In addition, the language and collection libraries work in tandem with the REPL because they, too, are lightweight – there’s no boilerplate syntax, classes, curly braces etc. – I can simply start exploring the data, coming up with conclusions, and then when I’m done, perhaps I might push this into a fully-featured DLL that I run once a day to pull down the latest emails and update my charts with.
There seem to be a number of posts out there on how to use an SignalR with an IoC container e.g. MS Unity. Nearly all of them of them seem to be taking a sledgehammer approach to solve what most people generally want to do, which is create their Hubs with an IoC container. They generally don’t want to replace all of SignalR’s internal dependencies.
The easiest way to get dependencies injected into SignalR hubs is not by creating your own DefaultDependencyResolver – doing that will hand over control to you for creating not just Hubs, but basically all the different components within the SignalR pipeline. Worse still, for an IoC container like Unity which can create concretes that have not explicitly been registered, it can make life much more complicated.
A simpler approach is simply to register an implementation of the IHubActivator, as below: -
The HubActivator will get called only when SignalR needs to create a Hub specifically; you hand control over to your IoC container to create it, along with any of its dependencies. Much easier than the other approach, and easier to reason about.
I plan on blogging a bit more about my experiences with writing Type Providers in general as there’s a dearth of material readily available online. At any rate, after several false starts, I now have a moderately usable version of a Azure Blob Storage type provider on GitHub!
It’s easy to consume – just create a specific type of account, passing in your account name and key, and away you go!
Using Type Providers to improve the developer experience
It’s important to notice the developer experience as you dot your way through. First you’ll get a live list of containers: -
Then, on dotting into a container, you’ll get a lazily-loaded list of files in that container: -
Finally, on dotting into a file, you’ll get some details on the last Copy of that file as well as the option to download the file: -
It’s important to note that the signature of the Download() function from file to file will change depending on the extension of the file. If it’s a file that ends in .txt, .xml or .csv, it will return Async<string>; otherwise it’s an Async<Byte>. This is completely strongly typed – there’s no casting or dynamic typing involved, and you don’t have to worry about picking the correct overload for the file :-). This, for me, is a big value add over a static API which cannot respond in this manner to the contents of the data that it operates over – and yet with a type provider it still retains static typing!
I think that this type provider is somewhat different to others like the excellent FSharp.Data ones, which are geared towards programmability etc. – this one is (currently) more suited to scripting and exploration of a Blob Storage account. I still need to make the provider more stable, and add some more creature comforts to it, but I’m hoping that this will make peoples lives a bit easier when you need to quickly and easily get some data out of (and in the future, into) a Blob store.
Since starting to deliver my “Power of F#” talk to user groups and companies (generally well received – I hope), and getting involved in a few twitter debates on F#, I’ve noticed a few common themes regarding why .NET teams aren’t considering trying out F# to add to their development stack. Part of this is the usual spiel of misinformation of what F# is and is not (“it’s not a general purpose language”), but another part of it comes from a conservatism that really surprised me. That is, either: -
- There’s no / limited Resharper / CodeRush support for F#. Ergo, I won’t be able to develop “effectively” in it
- If I start learning it, no-one else in my team will know what I’m doing and so we can’t start using it
Now allow me to attempt to debunk these two statements.
Resharper (or CR) support is a non-issue for me. Personally, I use CodeRush over Resharper, but let’s be honest about what both of these are: Tools to speed up development time. Now, some of the issues that they solve aren’t as big an issue in F# as in C#. Perhaps the one I do miss the most is rename symbol, but others like Extract to Method aren’t as big a problem in F# due to the extremely lightweight syntax, type inference and the fact that it’s an expression-oriented language. So, it’d be nice to have some support in the language for refactorings, for sure – but it should absolutely not be a barrier to entry.
The “team training” issue is a more serious one to my mind, primarily because it’s about the individual’s perception of learning a new language rather than some arbitrary piece of tooling. Trust me when I say you can be productive in F# in just a few days if you’re coming from C# or VB .NET – particularly if you’ve used LINQ and have a reasonable understanding of it (and it’s constituent parts in C#3). Cast your mind back to when people started adopting .NET from, say, VB6. Was there no training issue then? Or from C++? Learning F# is far easier – the whole framework is the same as C# – it’s just the plumbing that orchestrates your calls to the framework that look a little different.
There are certainly a number of features in F# which don’t really have direct equivalents in C#, and to get the most out of the language you’ll need to do a bit of learning to understand new concepts – just like, say, C#2 to C#3 . I would say that F# is in a sense a higher-level superset of C# - you can do pretty much everything you would in C# albeit in a different, terser syntax, plus a whole load of other bits and pieces which you can opt into as you get more comfortable with the language.
As developers, we need to remain open minded about the art of programming. It’s what allows us to use things like LINQ rather than simple for each loops. It’s also what allows us to say “yes, using white space to indicate scope instead of curly braces isn’t necessarily a bad thing” or “yes, the kewords ‘let’ and ‘do’ aren’t inherently evil in a programming language”. Keep an open mind and try something new – you might actually be pleasantly surprised!
Just a quick post and update on my review on JustMockLite from earlier this year. I had originally a few comments on some features which I’m pleased to say have now been rectified
Support (or lackof) for recursive mocks was one of my main criticisms with earlier versions of JML. For example, if you have a mock which itself had a method needed to return another mock – or worse still, needed to mock the result of a method on that child mock, it was a bit of a pain; you had to manually construct the child mock, and then arrange the top level call to return the child mock etc. etc. etc.
This simple code sample illustrates how recursive mocks are now extremely simple to do in JML. Child mocks are now automatically created without the need to explicitly create one, and you can chain a method call expression when arranging the result of a nested mock. Very nice.
This is a small but important feature for getting up to speed quicker – JML now includes comments on methods etc., which should aid in getting up and running without having to resort to the documentation.
This is all really good. I’d still love to see ignoring of arguments by default on method call arrangement, but overall JML continues to improve – definitely recommended.