MBrace, CloudFlows and FSharp.Data – data analysis made easy

mbrace-banner900x300

In case you’ve not seen it before, MBrace is a simple programming model for scalable cloud data scripting and programming with .NET. It’s written in F#, but has growing support for C# and VB .NET. Over the past year or so, I worked closely with the MBrace team to help get it working smoothly on Microsoft Azure, using features such as Service Bus and Storage to provide an excellent development and deployment experience. As MBrace gears up for a v1 release, the design of the API is looking extremely positive.

I’m going to demonstrate here a simple example that illustrates how easy it is to start working with a large CSV file available on the internet in an MBrace cluster, parsing and querying data easily – we’re going to analyse UK house prices over the past year (this file is freely available on the gov.uk website).

I’m going to assume that you have an MBrace cluster up and running – if you don’t, you can either use a local development cluster or download the latest source code and deploy a full cluster onto Azure using the example MBrace Worker Role supplied in the MBrace Azure source code.

Type Providers on MBrace

We’ll start by generating a schema for our data using FSharp.Data and its CSV Type Provider. Usually the type provider can infer all data types and columns but in this case the file does not include headers, so we’ll supply them ourselves. I’m also using a local version of the CSV file which contains a subset of the data (the live dataset even for a single month is > 10MB): –

In that single line, we now have a strongly-typed way to parse CSV data. Now, let’s move onto the MBrace side of things. I want to start with something simple – let’s get the average sale price of a property, by month, and chart it.

A CloudFlow is an MBrace primitive which allows a distributed set of transformations to be chained together, just like you would with the Seq module in F# (or LINQ’s IEnumerable operators for the rest of the .NET world), except in MBrace, a CloudFlow pipeline is partitioned across the cluster, making full use of resources available in the cluster; only when the pipelines are completed in each partition are they aggregated together again.

Also notice that we’re using type providers in tandem with the distributed computation. Once we call the ParseRows function, in the next call in the pipeline, we’re working with a strongly-typed object model – so DateOfTransfer is a proper DateTime etc. All dependent assemblies have automatically been shipped with MBrace; it wasn’t explicitly designed to work with FSharp.Data – it just works. So now that we have an array of int * float i.e. month * price, we can easily map it on a chart: –

MBrace1Easy.

Persisted Cloud Flows

Even better, MBrace supports something called Persisted Cloud Flows (known in the Spark world as RDDs). These are flows whose results are partitioned and cached across the cluster, ready to be re-used again and again. This is particularly useful if you have an intermediary result set that you wish to query multiple times. In our case, we might persist the first few lines of the computation (which involves downloading the data from source and parsing with the CSV Type Provider), ready to be used for any number of strongly-typed queries we might have: –

So notice that the first query takes 45 seconds to execute, which involves downloading the data and parsing it via the CSV type provider. Once we’ve done that, we persist it across the cluster in memory – then we can re-use that persisted flow in all subsequent queries, each of which just takes a few seconds to run.

Conclusion

MBrace is on the cusp of a 1.0 release – it’s ready for you to start using now, and offers not only a powerful and flexible set of abstractions for distributed computations, but as you can see from above, if you’ve used the collection libraries in F# before it’s a very smooth transition to make the leap to distributed collection queries. In less than ten lines of code, you can start writing distributed queries against live datasets with the minimum of effort.

F# Azure Storage Type Provider v1.0 released!


So, last week I finally released the F# Azure Storage Type Provider as v1! I learned a hell of a lot about writing Type Providers in F# as a result over the last few months… Anyway – v1.0 deals with Blobs and Tables; I’m hoping to integrate Queues and possibly Files in the future (the former is particularly powerful for a scripting point of view). You can get it on NuGet or download the source (and add issues etc.) through GitHub.

Working with Blobs

Here’s a sample set of containers and blobs in my local development storage account displayed through Visual Studio’s Server Explorer and some files in the “tp-test” container: –

1  2

You can get to a particular blob in two lines of code: –

3

The first line connects to the storage account – in this example I’m connecting to the local storage emulator, but to connect to a live account, just provide the account name and storage key. Once you navigate to a blob, you can download the file to the local file system, read it as a string, or stream it line-by-line (useful for dealing with large files). Of course you get full intellisense for the containers and folders automatically – this makes navigating through a storage account extremely easy to do: –

4 5

Working with Azure Tables

The Table section of the type provider gives you quick access to tables, does a lot of the heavy lifting for doing bulk inserts (automatically batching up based on partition and maximum batch size), and gives you a schema for free. This last part means that you can literally go to any pre-existing Azure Table that you might have and start working with it for CRUD purposes without any predefined data model.

Tables are automatically represent themselves with intellisense, and give you a simple API to work with: –

6

Results are essentially DTOs that represent the table schema. Whilst Tables have no enforced schema, individual rows themselves do have one, and we can interrogate Azure to understand that schema and build a strongly-typed data model over the top of it. So the following schema in an Azure table:

7

becomes this in the type provider: –

8

All the properties are strongly typed based on their EDM type e.g. string, float etc. etc.. We can also execute arbitrary plain-text queries or use a strongly-typed query builder to chain clauses and ultimately execute a query remotely: –

9

Whilst this is not quite LINQ over Azure, there’s a reason for this. Ironically, the Azure SDK supports IQueryable to table storage. But because Table Storage is weak, computationally speaking, there’s severe restrictions on what you can do with LINQ – basically just Where and Take. The benefit of a more restrictive query set that the Type Provider delivers is that it is guaranteed compile time to generate a query that will be accepted by Azure Tables, where IQueryable over Tables does not.

The generated provided types  also expose the raw set of values for the entities as key/values so that you can easily push this data into other formats e.g. Deedle etc. if you want.

Future Plans

I’d like to make some other features for the Storage Type Provider going forward, such as: –

  • Azure Storage Queue Support
  • “Individuals” support (a la the SQL Type Provider) for Tables
  • Support for data binding on generated entities for e.g. WPF integration
  • Potentially removing the dependency on the Azure Storage SDK
  • Option types on Tables (either through schema inference or provided by configuration)
  • Connection string through configuration
  • More Async support

If you are looking to get involved with working on a Type Provider – or have some Azure experience and want to learn more about F# – this would be a good project to cut your teeth on 🙂

Debunking the LINQ “magic” myth again


I’ve blogged before about how LINQ-to-Objects, at it’s most basic, is just about building on top of enumerating, one at a time, over collections via MoveNext(). It wraps it up in a beautiful API, but it’s still generally crawling through collections. I wanted to give an example of this in more depth and how the C# team introduced some smart optimisations where possible.

And I’m sure that Jon Skeet’s fantastic EduLinq blog post goes into more detail than here, but anyway…

LINQ’s Any() vs Count()

I know a lot of people that tend to check for the existence of an empty collection with a collection.Count() == 0 check (or > 1 for non-empty ones). This obviously works, but there’s a better alternative i.e. collection.Any(). There are two reasons why you should favour this over Count: –

Readability

Compare the two examples below: –

  • if (employees.Count() > 0)
  • if (employees.Any())

The latter reads much better. The former invariably means that in your head you’re doing a “conversion” to “if there are any items”.

Performance

How does .Count() work? It simply iterates over every item in the collection to determine how many there are. In this situation, .Any() will be much, much quicker because it simply tests if there is at least one item in the collection i.e. does MoveNext() return true or not.

Decompiling LINQ

There’s one situation where the above is not true. If the underlying enumerable collection implements either version of ICollection, and you’re calling the parameter-less version of Count(), that extension method is smart enough to simply delegate to the already-calculated Count property. Again – this only applies to the parameter-less version! So for the version of Count that takes in a predicate, the first of the next two samples will generally be much quicker than the latter: –

  • employees.Any(e => e.Age < 25);
  • employees.Count(e => e.Age < 25) > 0;

How can I prove this? Well, newer versions of CodeRush have a built-in decompiler so you can look at the source code of these methods (like Reflector does). I’m sure there are other tools out there that do the same… anyway, here’s an (ever so slightly simplified) sample of the implementations of the Count() and Any() methods. First, the parameter-less versions: –

image

image

The former optimises where possible, but if it can’t, has to fall back to iterating over the entire collection. The latter simply falls out if it succeeds in finding the first element.

Now here’s the predicated versions of those methods: –

image

image

The former has to iterate over every item in the collection to determine how many passed the predicate – there’s no optimisation possible here. The latter simply finds the first item in the collection that passes the predicate and breaks there.

Conclusion

Having read through those code samples, re-read the initial example of Any versus Count and ask yourself why you would use Count Smile

Again, just to drill home the point – there is no magic to LINQ. The C# developers did a great job optimising code where possible, but at the end of the day, when you have several options open to you regarding queries, make sure you choose the correct option for the right reason to ensure your code performs as well as possible.

Using Aggregate in LINQ


The System.Linq namespace has a load of useful extension methods like Where, Select etc. etc. that allow us to chain up bits of code that operate over sequences of data, allowing us to apply functional-style programming to our data.

There is one method which is often overlooked yet it is probably the one that lends itself best to functional programming is the Aggregate() method. Unlike methods such as a Select, which, given a set of n items, projects a set of n other items, Aggregate can be used as a way of merging a collection of items into a different number of items. Indeed, some LINQ methods can be implemented easily with aggregate, such as Sum: –

image

image

The syntax looks a bit bizarre, especially when you look at the function signature of the method (including the overloads), but essentially the method takes in a function which itself takes in two values and returns another: –

  • accumulator, which is an arbitrary object which is passed through every item in the collection. In the default overload of Aggregate, this is the same type as the source collection e.g. Int32.
  • value, which is the next value in the chain e.g. 1, then 2, then 3 etc.
  • result, which will become the accumulator in the next iteration

So, if we were to expand the above bit of code with debugging statements etc., it would look something like this: –image

Note that with the default function overload, the initial value of the accumulator is the first value in the collection (1), aka the “seed” value.

More complex uses of Aggregate

Let’s say we wanted to print a single string out which is all of the numbers separated by a space e.g. “1 2 3 4 5 6 7 8 9 10”. Common LINQ methods wouldn’t be appropriate for this. You could use Select to get the string representation, but would get a sequence of 10 strings rather than a single one. You might now fall back to foreach loops etc., but this is where Aggregate is useful: –

image

This overload of Aggregate takes in two arguments – the first is a “seed value” which will be the initial value of the accumulator, in our case an empty String. Every iteration takes the accumulator, appends the next number to it and returns the resultant String as the next accumulator, which gives us the following (debug statements added): –

image

Simples! (obviously in a real world example you might use a StringBuilder as your Accumulator instead).

Notice how in this example we didn’t return the same type as the collection that we operated over (i.e. Int32). We can use this technique to do all sorts of funky things to collections that you might not have considered before.

Conclusion

Aggregate is a rarely used but extremely powerful LINQ method. In my next post, I’ll build on this showing some more powerful (and perhaps useful!) examples of Aggregate.

LINQ in C#2


Introduction

Continuing my series of posts on LINQ, I wanted to give a simple example as to how one can get the same sort of functionality in terms of query composition and lazy evaluation by using the yield keyword and without using any of C#3’s features. Bear in mind that LINQ was introduced as part of .NET 3.5, which itself runs on the same CLR as .NET 2. So everything that happens with LINQ is “just” a set of compiler tricks and syntactic sugar etc. – at runtime there’s nothing that happens that can’t be done manually with C#2.

Here’s the task we’ll tackle: Get the next 5 dates that fall on a weekend.

Streaming data

In purely non-LINQ terms, we could easily carry out this operation as a while loop, bespoke for the problem at hand. However, this wouldn’t offer any of the benefits that an API like LINQ offers e.g. composability and reusability of operations, which is what we’re trying to achieve – so let’s assume we’re trying to use a query-style mechanism; also, we want to try to create something more like an Date query API that we could use to write other, similar queries in future.

In C#3 using LINQ we might use Where() to filter out non-weekend days and then Take() to retrieve ten items. But there’s an initial challenge that we encounter when trying to do this query with LINQ – what exactly do we query over – what set of data do we operate over? There’s no static “All Dates” property in .NET, and we don’t know in advance the set of dates to query over. This is where yield comes in. It allows us to easily create sequences of data that can be generated at run-time and queried over.

Take a look at this: –

image

Ignore the DisposableColour class – it just temporarily changes the colour of the console foreground. What’s more important is that this method returns something that masquerades as a collection of DateTimes – when in reality we’re generating a infinite stream of DateTimes, starting at the date argument provided. This collection has no end and you can never ToList() it to fully execute it. Well, you could try, but it would keep going until DateTime.Max is reached. It simply generates dates on demand starting from the provided date.

Implementing composable query methods

Given this stream, we can write two other methods which firstly filter out dates that do not match Saturday or Sunday and another one which will only “take” a number of items from the sequence and then stop: –

image

Notice with the above method, we only yield out dates that match the provided days required, otherwise it doesn’t give back anything and implicitly iterates to the next item. Next, here’s a generic implementation of Take. It returns the next item in the collection, and then when it has returned the required number of items, breaks the foreach loop.

image

Consuming composable methods

Imagine that the methods above lived in a self-contained API that allowed us to easily query DateTimes – here’s how we could use it to answer our original question: –

image

All we do is generate the stream of DateTimes and then pipe them through the two other methods. The beauty of this is that because we’re yielding everything from the first method to the last, we only generate DateTimes that are required.

The key is the CreateDateStream method i.e. the stream of Dates. We cannot generate "every” date in advance – that would be grossly inefficient; it’s much better to dynamically create a stream as required.

  • dateStream is a stream of all dates starting from DateTime.Today.
  • weekendDays is the filtered stream of dates from dateStream that fall on saturday or sunday
  • nextUpcomingWeekendDays is the stream of the first 5 items from weekendDays

If we run the code above, we get the following output: –

image

Look at the messages in more detail. We only created enough dates until we matched 5 weekend days. Only those dates that fall into the required filter criteria get streamed into the Take() method, and only those fall out into Main. When we’ve taken enough, Take() breaks the loop which ends the foreach.

Conclusion

Yield is one of the key enablers for writing lazily-evaluated queries and collections. Without it, your queries would be less composable as well as less efficient; streaming out data as required allows us to only generate that part of the data that we still require.

In our example, we could get the next ten upcoming days without filtering, because Take also operates on IEnumerable<DateTime> – we can simply chain up our methods as and where needed.

Lastly – we could easily change the signatures of the three API methods to make them Extension Methods to give us a more LINQ-style DSL. It looks more like LINQ, but it’s still exactly the same code: –

image

If you’re struggling with this, it might help you to write out the code yourself and step through it with the debugger to see the actual flow of messages, or try creating some simple yield-style collections yourself.

Psychic LINQ


A relatively short post on cargo cult programming, particularly related to LINQ.

LINQ is a fantastic technology. The idea of making a platform-agnostic query language is a fantastic idea. You can write the same query, in C#, over an in-memory list or a database and from the client point of view treat it in the same way. Isn’t it wonderful!

I’ve recently carried out a number of job interviews where candidates had to answer the following question:

If you wanted to find all Customers whose name was “Isaac”, why would you use a .Where() clause over a collection of T rather than using a foreach loop and manually construct the result set?

The results were varied. What I was looking for was a discussion of the benefits of declarative versus imperative code; what versus how etc.; composability of queries etc.

Strangely enough, the most common answer I got was "LINQ is faster than a foreach loop". Why? Either because LINQ somehow "optimises" code to make it faster, or because it "doesn’t need to loop through a collection – it just ‘does’ a where clause over the collection so that it instantly finds it". Almost as if C# is doing magic! In both cases the candidates could not justify their beliefs with evidence – it was just their feeling that that “must” be the case.

Now, lets talk about the reality. I would direct everyone to Jon Skeet’s fantastic EduLinq blog series to get an in-depth understanding of how LINQ over objects works, but always remember this simplification:

The only methods and properties that LINQ has for all its query methods are sourced from IEnumerator <T>

  • Boolean MoveNext();
  • T Current { get; }

    That’s it. Think about that. There is no magic involved. If you do a Where() clause over a collection, you will enumerate the entire source. There is no “pixie dust” that will give LINQ the answer quicker than a foreach loop and an if / then statement – and bear in mind, foreach loops are just syntactic sugar over IEnumerator, just like the using statement wraps IDisposable.

Let there be LINQ


Just a quick post regarding use of the let keyword in LINQ, which I find to be somewhat under-used by many people. Whilst one benefit of it can be readability (i.e. aliasing sections of complex queries to aid understanding of the query), the other benefit can be performance.

There is indeed a cost associated with using it i.e. every time you use it, you’re effectively creating a new anonymous type to hold that plus whatever the previous result in your query pipeline. So if you chain up lots of lets in a query, that’ll have an impact on the query. However, there is a case where let can give large performance benefits: –

image

Compare that code with the following: –

image

This will eliminate a massive number of calls to Convert.ToInt32() and reduce the time taken to process that query by around 40%; the former sample took ~1400ms to run whereas the latter took only around 800ms.