This sounds like a fairly mundane task – one which we probably do all the time. Yet, particularly with C#3 and the System.Enumerable namespace, there are a number of ways of achieving the following task: –
Find a single item in a collection based on its ID
How often do we do something like this? Perhaps you have a collection of employees and you wish to find one based on their payroll number. Or resolve a lookup from another entity which isn’t directly linked (e.g. some reference data).
A colleague of mine recently had an issue where they were performing some code on loading a Form which, under certain conditions, was taking ridiculous amounts of time to complete – I’m talking five minutes or more. The code looked something like this: –
We traced the problem to repeatedly calling the Single () method over a large collection of object – there were a collection of several thousand objects, and for each one, the code needed to populate a field on that object based on a lookup value contained in another collection. This second collection held perhaps 50K items. We made massive time savings by pushing the lookup collection into a Dictionary (keyed on the ID) and using that instead – I’m talking huge savings.
So, I figured that I’d try to write a simple tool to compare a few common methods for finding a single item in an IEnumerable<T> collection, and seeing how those search methods performed differently. There is only so far I took it, but the upshot is that I wrote an application that creates a collection of x objects with a single property, ID (which is an int) rising sequentially. So a collection of 10,000 would look something like this: –
I then generate a second collection container just integers, which is my query data. So if my sample data is 10,000 objects long, my query data would contain perhaps 1,000 integers randomly allocated between 0 – 9,999.
Finally, I push those two collections into a variety of sort algorithms built into some of the .NET collections, and then record how they perform. The main collections & algorithms that I tried were: –
Enumerable.First: returns the first item in the collection matching the predicate provided.
Enumerable.Single: returns the single item in the collection matching the predicate provided. If more than one item is found, an exception is thrown.
List.Find: returns the first item in the collection matching the predicate provided.
List.BinarySearch: returns the first item in the List<T> matching the example object T provided. The List<T> must be sorted using an IComparer.
Dictionary[key]: returns the item T in the Dictionary keyed on some unique value Q.
Obviously I expected Dictionary to be the quickest – and it was – but I didn’t realise the differences in implementation between some of the other above methods which led to very different search results.
For small-ish collections (say, 1,000 items and 500 queries), search times are fairly similar. There’s obviously a cost associated with creation of a Dictionary or List (which is at least equal to the cost of iterating over the entire collection), as well as sorting the list for BinarySearch, plus with small collections there’s probably little difference anyway.
However, here are some other sample sizes:-
2000 items, 1000 queries
5000 items, 2500 queries
10,000 items, 5000 queries
25,000 items, 10,000 queries
50,000 items, 25,000 queries
These are also linear scaled, so at greater values you lose sight in the progress bars of just how much quicker the bottom two search methods are – just look at the ratio figures on the right. Some conclusions to be had from these results…
Use Single sparingly
I was initially surprised at just how much slower IEnumerable.Single is than IEnumerable.First until I actually thought about it (and looked at the code in Reflector 🙂 – Single has to iterate through every item in the collection, even after finding the item, in order to confirm that there are no other instances. So the time taken is probably roughly equal to the cost of iterating through the entire collection plus executing your predicate.
So whilst (from a business logic point-of-view) it might make sense to use this, unless there is a risk that there is ever going to be more than one item based on your predicate (or it doesn’t matter), use First – it only has to traverse the collection until it has found the item you’re looking for, at which point it exits. This leads it to be, on average, almost twice as first as Single.
Use BinarySearch for sorted collections
The cost of creating the list and sorting it, again assuming that you have a integer-based sort key, is very low. Even for a string-based sort key, once you have sorted the collection, the search algorithm is so much faster that only over large collections does it start to perform slower than a Dictionary. In fact, for small collections, it’s faster. The only “cost” from a development point of view is the need to create your own IComparer – not a massive overhead, but slightly more work from a coding point of view.
Use a Dictionary for keyed collections
As long as you can generate a decent key – and in examples such as the one above, there’s one ready-made for you – Dictionary is a quick and easy way of searching repeatedly through a collection of data for specific items. The CPU cost of generating the Dictionary is very low, and the benefits are seen within just a few queries afterwards. The speed of lookups does not vary greatly with larger collections, and it’s easy from a coding point of view to access them.