Introducing HadoopFs – F# for Hadoop


Having spent a while using Hadoop on HDInsight now, I wanted to look at writing Hadoop mapper and reducers in F#. There are several reasons for this as opposed to other languages such as Java, Python and C#. I’m not going to go into all of the usual features of F# over other languages, but the main reason is because F# lets you just “get on” with dealing with data. That’s what one of it’s main strengths, in my opinion, and is what most map-reduce jobs are about.

There’s already a .NET SDK for Hadoop that Microsoft have released. However, it does have some issues with it, not just in terms of functionality but also in terms of how well it maps with F#. The main problem that I have with it is that you write your code in an object hierarchy, inheriting from MapperBase or ReducerCombinerBase. You then have to mutate the Context that’s passed in with any outputs from your Mapper or Reducer.

I wanted something that was a bit more lightweight, and also allowed me to explore creating a parser from the Streaming Hadoop inputs. So, I’ve now put HadoopFs on GitHub, with the intention to put it on NuGet in the short term future. The main things is gives you is the ability to write mapper and reducers very easily without the need to “inherit” from any classes or anything, and also a flexible IO mechanism, so you can pipe data in or out from the “real” console (for use with the real Hadoop), file system or in-memory lists etc. (essentially anything that can be used to generate a sequence of strings). So the prototypical wordcount map / reduce looks like this: –

Three lines for the mapper (including function declaration) and four lines for the reducer. Nice. Notice that you do not need to have any dependency on HadoopFs to write your map / reduce code. It’s just a couple of arbitrary functions, which has several benefits. Firstly, it’s more accessible than having to understand a “framework” – all you have to do is understand the Hadoop MR paradigm and you’re good to go. Secondly, it’s easier to test – you can always much more easily test a pure function than something which involves e.g. mutating state of some “context” object that you need to create and provide.

The only times you use the HadoopFs types and functions is when plugging in your MR code into an executable for use with Hadoop:-

You can see from the last example how you can essentially plug in any input / output source e.g. file system or console etc.. This is very useful for e.g. unit testing as you can simply provide an in-memory list of strings and get back the output from a full map-reduce.

I still have some more work to do on it – some cleaning up of the function signatures for consistency etc., and there’s no doubt some extra corner cases to deal with, but as an experiment in doing this in a day or so, it was a good learning exercise in Hadoop streaming. Indeed, the hardest part was actually in generating a lazy group of key/values for the reduce from a flat list of sorted input rows. I’d also like to write a generic MapReduce executable that can be parameterised for the mapper or reducer that you need.

All said though, considering the entire framework including test helper classes is less than 150 lines of code, it’s quite nice I think.

Advertisements

7 thoughts on “Introducing HadoopFs – F# for Hadoop

  1. Hi Isaac – this bite-size library is awesome. Do you have any examples of use on HDInsights? I’m running Hbase on a windows cluster and trying to figure out how I can 1) deploy MR in F# and 2) access Hbase table mr utilities in .NET. Any help would be tremendously appreciated.

    1. Deploying MR jobs on HDI isn’t that hard, but it is tricky. You need to copy all DLLs (including dependencies) into the Azure Blob Storage account that you have bound the HDI cluster to. Then, when you run a job, ensure that you pass the full path to the executable that will handle the MR data stream using a standard Azure Storage WASB path. This post is somewhat old but is probably still a useful starting point: http://blogs.msdn.com/b/cindygross/archive/2013/04/25/access-azure-blob-stores-from-hdinsight.aspx. I’d like to see a set of FAKE tasks written that would make the job of uploading code to blob storage easier for HDI though!

  2. Thanks Isaac! I took a closer look at Core.fs and saw that one can define custom readers and writers. I’d like to see if I can access the Hbase TableMapper somehow… will keep you posted.

  3. Hey Isaac. Have you done any more work with F# on HDInsight? Have you seen other developments on this topic? I’ve searched, but haven’t found anything. Thanks.

  4. Hey Nelson. Not really – I did it more of as a proof of concept to prove that one could adopt a purely functional pipeline and get it working without much fuss. You can of course use the “full” C# SDK on HDInsight but I’d probably not bother. Frankly, these days if someone wanted to do Hadoop in F# I’d point to firmly in the direction of MBrace first anyway 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s