Having spent a while using Hadoop on HDInsight now, I wanted to look at writing Hadoop mapper and reducers in F#. There are several reasons for this as opposed to other languages such as Java, Python and C#. I’m not going to go into all of the usual features of F# over other languages, but the main reason is because F# lets you just “get on” with dealing with data. That’s what one of it’s main strengths, in my opinion, and is what most map-reduce jobs are about.
There’s already a .NET SDK for Hadoop that Microsoft have released. However, it does have some issues with it, not just in terms of functionality but also in terms of how well it maps with F#. The main problem that I have with it is that you write your code in an object hierarchy, inheriting from MapperBase or ReducerCombinerBase. You then have to mutate the Context that’s passed in with any outputs from your Mapper or Reducer.
I wanted something that was a bit more lightweight, and also allowed me to explore creating a parser from the Streaming Hadoop inputs. So, I’ve now put HadoopFs on GitHub, with the intention to put it on NuGet in the short term future. The main things is gives you is the ability to write mapper and reducers very easily without the need to “inherit” from any classes or anything, and also a flexible IO mechanism, so you can pipe data in or out from the “real” console (for use with the real Hadoop), file system or in-memory lists etc. (essentially anything that can be used to generate a sequence of strings). So the prototypical wordcount map / reduce looks like this: –
|/// A sample map reducer that counts words in a document. Notice how the outputs of both map and reduce do not have to be strings.|
|/// A sample mapper that splits lines based on spaces into words and counts the number of occurences within a line.|
|let Mapper (row : string) =|
|row.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries)|
||> Seq.countBy id|
|/// A sample reducer that counts words supplied from the word count mapper.|
|let Reducer (key, values) =|
|Some (key, values|
||> Seq.map Int32.Parse|
Three lines for the mapper (including function declaration) and four lines for the reducer. Nice. Notice that you do not need to have any dependency on HadoopFs to write your map / reduce code. It’s just a couple of arbitrary functions, which has several benefits. Firstly, it’s more accessible than having to understand a “framework” – all you have to do is understand the Hadoop MR paradigm and you’re good to go. Secondly, it’s easier to test – you can always much more easily test a pure function than something which involves e.g. mutating state of some “context” object that you need to create and provide.
The only times you use the HadoopFs types and functions is when plugging in your MR code into an executable for use with Hadoop:-
|/// A mapper exe|
|let mainMap argv =|
|doMap <| ManyOutputs WordCount.Mapper|
|/// A reducer exe|
|let mainReduce argv =|
|doReduce <| SingleOutput WordCount.Reducer|
|/// Testing a full map reduce, reading inputs from the file system and writing output to the console.|
|let mainMap argv =|
|doMapReduce <| ManyOutputs WordCount.Mapper <| SingleOutput WordCount.Reducer <| Readers.FileSystem(@"C:\Input.txt") <| Writers.Console|
You can see from the last example how you can essentially plug in any input / output source e.g. file system or console etc.. This is very useful for e.g. unit testing as you can simply provide an in-memory list of strings and get back the output from a full map-reduce.
I still have some more work to do on it – some cleaning up of the function signatures for consistency etc., and there’s no doubt some extra corner cases to deal with, but as an experiment in doing this in a day or so, it was a good learning exercise in Hadoop streaming. Indeed, the hardest part was actually in generating a lazy group of key/values for the reduce from a flat list of sorted input rows. I’d also like to write a generic MapReduce executable that can be parameterised for the mapper or reducer that you need.
All said though, considering the entire framework including test helper classes is less than 150 lines of code, it’s quite nice I think.