Determining graph sizes efficiently with F#

Introduction

Continuing on with my gaming-in-F# post, this week’s post is derived from this challenge. The initial challenge is, for each node, to determine the “efficiency” of a node within a network, that is, to calculate the maximum number of hops it takes to reach all edges of the graph starting from that node. So given the graph below, starting from node 3, it’s 2 hops (in this case, to all edges i.e. 1, 5, 6 and 8). However, if you were to start at, say, node 2, it’d be 3 hops (1 hop to node 1, but 3 hops to nodes 5, 6 and 8). Therefore, node 3 is considered a more “efficient” node. Secondly, you should determine the most efficient node in the whole network – the objective being to calculate the most effective starting spot in a graph from which to reach all other nodes in the entire network. Disclaimer: I should mention that, whilst I’ve been writing coding for a number of years, the degree I took was not heavily computer-science related – so things like big O notation, complexity theory etc. – all these are things I’ve learned over the year but never formally studied. So some of things below may seem second nature to you if you come from more of a maths background!

I identified three main elements in solving this challenge.

Representing the data

Relations are provided as as a simple int * int tuple i.e. (1,3) means that there’s a (bi-directional) connection between nodes 1 and 3. So I build a simple Map<int, int []>, which is essentially a lookup that says “for a given node, give me the list of connected nodes”. Note that I decided not to use a “proper” representation of the graph here – an idiomatic way might have been a discriminated union with Node and Leaf etc. etc… – a simple Map proved enough for me.

Implementing the algorithm

Firstly, implement the logic to calculate the maximum distance to all edges, for every starting point in the graph. I solved this with (allegedly) what is essentially a recursive depth-first search. In other words, navigate from the starting point as far outwards as you can; then, walk back in until you find another branch, and go out there. Once you start walking back in, start counting how many steps it is until you reach a branch point. Once you have calculated the distance all of branches, take the largest one.  Repeat this until you have exhausted all points in the graph and walked back to the starting point.

It should be now a simple task to simply apply this algorithm to all nodes in the graph, and then take the smallest one – this is the most efficient point in the graph from where to start.

Improving efficiency

Note that this wasn’t my first solution! The biggest challenge came when one of the exercises in the website provided a large set of connections in the graph – say, 30,000. At this point, my algorithm simply didn’t perform, so I had to look at some ways to improve performance. I tried a load of different things, each of which yielded some performance improvement, but not enough: –

1. Moving from Sequences to Arrays. Sequences are flexible and sometimes very useful, but Arrays generally will outperform it for maps and filters etc., particularly if you are repeating an operation over the same sequence many times (although there is Seq.cache).
2. Added state tracking. For each starting point, I would record the efficiency, and then provide that number to the next starting point. Whenever I encountered a graph that had a size at least equal to the score of the “most efficient node” found so far, I would immediately stop looking at that node and backtrack all the way out. This provided a good boost in performance, but not enough.
3. I also experimented with either minor algorithmic improvements, such as prematurely exiting a particular route of the graph if we identified any single route that exceeded the current best size rather than iterating all children nodes and evaluating them together.

None of these solutions gave an optimal solution – all they did was increase the complexity of the solution at the cost of moderate performance gains. I realised that there must be another approach that I was missing that would provide the solution. Eventually I realised how I could probably largely improve efficiency – because when you start from two separate nodes, there’s usually a large amount of repeated traversals across them both. Take the above graph – you can view the network efficiency of e.g. node 3 as either described above, or as (the highest efficiency of all adjacent nodes for that subset of the graph) + 1.

In the image below, we know that Node 3 has an efficiency of 2 because the efficiency of Node 4 is 1, Node 7 is 1 and Node 2 is 1. Take the maximum of these (1), add 1, and we get 2. So, given this, why not simply cache the results for every backtrack score from each pass? We can modify our above traversal code with some memoization backed by a simple Dictionary (apologies for using a mutable Dictionary – you could probably use an immutable Map if you wanted, although performance would probably suffer a bit) and then before making any outward facing movement in the graph, we check if that movement has already been made – if so, we can just use that result. This is why the final algorithm counts inwards from the edges rather than outwards – in order to allow caching of distances.

You can see from the logging messages above that although it should take 7 steps to calculate the efficiency of any given starting node, it only takes two calculations + one cache hit when calculating efficiency the second time, and two calcs + three cache hits the third time. Notice also the simplicity of the caching layer – types are (as usual) inferred by the compiler, and automatic generalization is used here too (e.g. notice that the Dictionary never has type arguments supplied) – it just works.

In terms of performance, you can observe the effect that the caching has for varying network sizes below. Notice that the graph is scaled logarithmically – for more than around 20 relationships, the cost of not caching becomes enormous as a single cache hit can save literally thousands of steps walking the graph. Conclusion

I found this to be a stimulating challenge to solve. It wasn’t necessarily because of the challenge of some specific domain solving issue but rather one regarding optimising a solution in a specific manner i.e. caching. What I particularly liked was that F# allowed us to retain the overall feel of the algorithm, but add in the caching layer very easily. In addition, adding in caching allowed me to simplify the overall algorithm – I didn’t need to worry about making four or five small, specific improvements – instead, a single optimisation allowed me to combine it with a simpler algorithm, yet still get a massive performance boost.