Determining graph sizes efficiently with F#


Introduction

Continuing on with my gaming-in-F# post, this week’s post is derived from this challenge. The initial challenge is, for each node, to determine the “efficiency” of a node within a network, that is, to calculate the maximum number of hops it takes to reach all edges of the graph starting from that node. So given the graph below, starting from node 3, it’s 2 hops (in this case, to all edges i.e. 1, 5, 6 and 8). However, if you were to start at, say, node 2, it’d be 3 hops (1 hop to node 1, but 3 hops to nodes 5, 6 and 8). Therefore, node 3 is considered a more “efficient” node. Secondly, you should determine the most efficient node in the whole network – the objective being to calculate the most effective starting spot in a graph from which to reach all other nodes in the entire network.

fileservlet

Disclaimer: I should mention that, whilst I’ve been writing coding for a number of years, the degree I took was not heavily computer-science related – so things like big O notation, complexity theory etc. – all these are things I’ve learned over the year but never formally studied. So some of things below may seem second nature to you if you come from more of a maths background!

I identified three main elements in solving this challenge.

Representing the data

Relations are provided as as a simple int * int tuple i.e. (1,3) means that there’s a (bi-directional) connection between nodes 1 and 3. So I build a simple Map<int, int []>, which is essentially a lookup that says “for a given node, give me the list of connected nodes”. Note that I decided not to use a “proper” representation of the graph here – an idiomatic way might have been a discriminated union with Node and Leaf etc. etc… – a simple Map proved enough for me.

Implementing the algorithm

Firstly, implement the logic to calculate the maximum distance to all edges, for every starting point in the graph. I solved this with (allegedly) what is essentially a recursive depth-first search. In other words, navigate from the starting point as far outwards as you can; then, walk back in until you find another branch, and go out there. Once you start walking back in, start counting how many steps it is until you reach a branch point. Once you have calculated the distance all of branches, take the largest one.  Repeat this until you have exhausted all points in the graph and walked back to the starting point.

It should be now a simple task to simply apply this algorithm to all nodes in the graph, and then take the smallest one – this is the most efficient point in the graph from where to start.

Improving efficiency

Note that this wasn’t my first solution! The biggest challenge came when one of the exercises in the website provided a large set of connections in the graph – say, 30,000. At this point, my algorithm simply didn’t perform, so I had to look at some ways to improve performance. I tried a load of different things, each of which yielded some performance improvement, but not enough: –

  1. Moving from Sequences to Arrays. Sequences are flexible and sometimes very useful, but Arrays generally will outperform it for maps and filters etc., particularly if you are repeating an operation over the same sequence many times (although there is Seq.cache).
  2. Added state tracking. For each starting point, I would record the efficiency, and then provide that number to the next starting point. Whenever I encountered a graph that had a size at least equal to the score of the “most efficient node” found so far, I would immediately stop looking at that node and backtrack all the way out. This provided a good boost in performance, but not enough.
  3. I also experimented with either minor algorithmic improvements, such as prematurely exiting a particular route of the graph if we identified any single route that exceeded the current best size rather than iterating all children nodes and evaluating them together.

None of these solutions gave an optimal solution – all they did was increase the complexity of the solution at the cost of moderate performance gains. I realised that there must be another approach that I was missing that would provide the solution. Eventually I realised how I could probably largely improve efficiency – because when you start from two separate nodes, there’s usually a large amount of repeated traversals across them both. Take the above graph – you can view the network efficiency of e.g. node 3 as either described above, or as (the highest efficiency of all adjacent nodes for that subset of the graph) + 1.

In the image below, we know that Node 3 has an efficiency of 2 because the efficiency of Node 4 is 1, Node 7 is 1 and Node 2 is 1. Take the maximum of these (1), add 1, and we get 2.

fileservlet4So, given this, why not simply cache the results for every backtrack score from each pass? We can modify our above traversal code with some memoization backed by a simple Dictionary (apologies for using a mutable Dictionary – you could probably use an immutable Map if you wanted, although performance would probably suffer a bit) and then before making any outward facing movement in the graph, we check if that movement has already been made – if so, we can just use that result. This is why the final algorithm counts inwards from the edges rather than outwards – in order to allow caching of distances.

You can see from the logging messages above that although it should take 7 steps to calculate the efficiency of any given starting node, it only takes two calculations + one cache hit when calculating efficiency the second time, and two calcs + three cache hits the third time. Notice also the simplicity of the caching layer – types are (as usual) inferred by the compiler, and automatic generalization is used here too (e.g. notice that the Dictionary never has type arguments supplied) – it just works.

In terms of performance, you can observe the effect that the caching has for varying network sizes below. Notice that the graph is scaled logarithmically – for more than around 20 relationships, the cost of not caching becomes enormous as a single cache hit can save literally thousands of steps walking the graph.

ImageConclusion

I found this to be a stimulating challenge to solve. It wasn’t necessarily because of the challenge of some specific domain solving issue but rather one regarding optimising a solution in a specific manner i.e. caching. What I particularly liked was that F# allowed us to retain the overall feel of the algorithm, but add in the caching layer very easily. In addition, adding in caching allowed me to simplify the overall algorithm – I didn’t need to worry about making four or five small, specific improvements – instead, a single optimisation allowed me to combine it with a simpler algorithm, yet still get a massive performance boost.

Tips on writing an EF-based ETL


I’ve been working on a relatively small ETL (that’s Extract-Transform-Load) process recently. It’s been written in C# using EF4 and is designed to migrate some data – perhaps a million rows in total – from one database schema to another as a one-off job. Nothing particularly out of the ordinary there; the object mapping is somewhat tricky but nothing too difficult. There were a few things I’ve been doing lately to improve the performance of it, and I thought I’d share some of those with you.

Separate out query and command. This is probably the hardest thing to change after you get started, so think about this up front. What I mean by this is this: let’s say you write some code that navigates through a one-to-many object graph and for each child, fires off a query to the DB to retrieve some other data, and then acts on that bit of retrieved data. You then discover that sometimes, you’ll have 5,000 children in the graph, which equates to 5,000 queries being fired off to the DB. Instead of this, why not just write a single set-based query which performs a “where in”-style query to retrieve all the data you’ll need in one go. Then you can, in memory, iterate over each of them one at a time. This will give you a big performance boost. In order to do this, you need to be careful to construct your code such that you decouple the bit that queries the data to retrieve a collection of object and the part that operates on each single object one a time, and then have a controlling method which orchestrates the two together. Doing this design upfront is much, much easier than trying to do it afterwards. In essence, if you know up front what data you need to operate on, try to pull as much of that in together rather than doing it bit-by-bit.

Keep your object graphs as small as possible. By this I mean do not read in fields or elements of the graph that you do not need. If necessary, construct DTOs and use EF’s projection capabilities to construct them, rather than reading back an entire EF-generated object graph when you only need 5 properties out of the 150 available.

Use the VS performance profiler. Or ANTS etc. The best part of the VS perf tool is the Tier Interaction Profiler (TIP). This monitors all the queries you fire off to the DB and shows you stats like how many of them there were, how long they took etc. – great for finding bottlenecks.

Avoid lazy loading. It’s seductive but will again negatively impact performance by quietly hammering the database without you realising it.

Use compiled EF queries. For queries that you are repeatedly calling – especially complex ones – compiling them will give you a nice boost in performance.

Keep the “destination” EF context load as low as possible. In the context of EF, this means NOT doing things like using a single session for the entire process. It’ll just get bigger and bigger as you add stuff to it. Instead, try to keep them relatively short lived – perhaps one for x number of source aggregate that you process.

Use No-Tracking for read-only (source) EF contexts. This means you can essentially just re-use the same context across the whole application as the context just becomes a gateway to the DB, but nothing more.

Do not batch up too much in one go. The ObjectStateManager that tracks changes behind the scenes of an EF context is a glutton for memory. If you have a large graph to save, top even 500 or 1,000 insertions and you’ll see your memory footprint creeping up and up; calling SubmitChanges() on a regular basis can alleviate this (at least, that’s the behaviour that I’ve observed).

Separate out writing to reference and entity data. If you are inserting reference data, create an entirely separate context for it. Do NOT share the entities with your main “destination” entity model. Instead, just refer to reference data items by ID. The benefits of doing this are that you can much more easily cache your reference data without falling foul of things like attaching the same object into multiple contexts etc.

Ask yourself if you really need to use EF. There are many forward-only data-access mechanisms available in .NET that out-perform EF. For reading from the source database, you could use something like Dapper or Massive intead of EF. I can’t comment on the performance of Massive, but Dapper is certainly extremely quick. You will lose the ability to write LINQ queries though, and will have to manually construct your entire source database domain model. Again though, that may not be such a bad thing if you design DTOs that work well in a set-based query environment.

Let there be LINQ


Just a quick post regarding use of the let keyword in LINQ, which I find to be somewhat under-used by many people. Whilst one benefit of it can be readability (i.e. aliasing sections of complex queries to aid understanding of the query), the other benefit can be performance.

There is indeed a cost associated with using it i.e. every time you use it, you’re effectively creating a new anonymous type to hold that plus whatever the previous result in your query pipeline. So if you chain up lots of lets in a query, that’ll have an impact on the query. However, there is a case where let can give large performance benefits: –

image

Compare that code with the following: –

image

This will eliminate a massive number of calls to Convert.ToInt32() and reduce the time taken to process that query by around 40%; the former sample took ~1400ms to run whereas the latter took only around 800ms.