Archive
Debunking the LINQ “magic” myth again
I’ve blogged before about how LINQ-to-Objects, at it’s most basic, is just about building on top of enumerating, one at a time, over collections via MoveNext(). It wraps it up in a beautiful API, but it’s still generally crawling through collections. I wanted to give an example of this in more depth and how the C# team introduced some smart optimisations where possible.
And I’m sure that Jon Skeet’s fantastic EduLinq blog post goes into more detail than here, but anyway…
LINQ’s Any() vs Count()
I know a lot of people that tend to check for the existence of an empty collection with a collection.Count() == 0 check (or > 1 for non-empty ones). This obviously works, but there’s a better alternative i.e. collection.Any(). There are two reasons why you should favour this over Count: -
Readability
Compare the two examples below: -
- if (employees.Count() > 0)
- if (employees.Any())
The latter reads much better. The former invariably means that in your head you’re doing a “conversion” to “if there are any items”.
Performance
How does .Count() work? It simply iterates over every item in the collection to determine how many there are. In this situation, .Any() will be much, much quicker because it simply tests if there is at least one item in the collection i.e. does MoveNext() return true or not.
Decompiling LINQ
There’s one situation where the above is not true. If the underlying enumerable collection implements either version of ICollection, and you’re calling the parameter-less version of Count(), that extension method is smart enough to simply delegate to the already-calculated Count property. Again – this only applies to the parameter-less version! So for the version of Count that takes in a predicate, the first of the next two samples will generally be much quicker than the latter: -
- employees.Any(e => e.Age < 25);
- employees.Count(e => e.Age < 25) > 0;
How can I prove this? Well, newer versions of CodeRush have a built-in decompiler so you can look at the source code of these methods (like Reflector does). I’m sure there are other tools out there that do the same… anyway, here’s an (ever so slightly simplified) sample of the implementations of the Count() and Any() methods. First, the parameter-less versions: -
The former optimises where possible, but if it can’t, has to fall back to iterating over the entire collection. The latter simply falls out if it succeeds in finding the first element.
Now here’s the predicated versions of those methods: -
The former has to iterate over every item in the collection to determine how many passed the predicate – there’s no optimisation possible here. The latter simply finds the first item in the collection that passes the predicate and breaks there.
Conclusion
Having read through those code samples, re-read the initial example of Any versus Count and ask yourself why you would use Count ![]()
Again, just to drill home the point – there is no magic to LINQ. The C# developers did a great job optimising code where possible, but at the end of the day, when you have several options open to you regarding queries, make sure you choose the correct option for the right reason to ensure your code performs as well as possible.
Using Aggregate in LINQ
The System.Linq namespace has a load of useful extension methods like Where, Select etc. etc. that allow us to chain up bits of code that operate over sequences of data, allowing us to apply functional-style programming to our data.
There is one method which is often overlooked yet it is probably the one that lends itself best to functional programming is the Aggregate() method. Unlike methods such as a Select, which, given a set of n items, projects a set of n other items, Aggregate can be used as a way of merging a collection of items into a different number of items. Indeed, some LINQ methods can be implemented easily with aggregate, such as Sum: -
The syntax looks a bit bizarre, especially when you look at the function signature of the method (including the overloads), but essentially the method takes in a function which itself takes in two values and returns another: -
- accumulator, which is an arbitrary object which is passed through every item in the collection. In the default overload of Aggregate, this is the same type as the source collection e.g. Int32.
- value, which is the next value in the chain e.g. 1, then 2, then 3 etc.
- result, which will become the accumulator in the next iteration
So, if we were to expand the above bit of code with debugging statements etc., it would look something like this: -![]()
Note that with the default function overload, the initial value of the accumulator is the first value in the collection (1), aka the “seed” value.
More complex uses of Aggregate
Let’s say we wanted to print a single string out which is all of the numbers separated by a space e.g. “1 2 3 4 5 6 7 8 9 10”. Common LINQ methods wouldn’t be appropriate for this. You could use Select to get the string representation, but would get a sequence of 10 strings rather than a single one. You might now fall back to foreach loops etc., but this is where Aggregate is useful: -
This overload of Aggregate takes in two arguments – the first is a “seed value” which will be the initial value of the accumulator, in our case an empty String. Every iteration takes the accumulator, appends the next number to it and returns the resultant String as the next accumulator, which gives us the following (debug statements added): -
Simples! (obviously in a real world example you might use a StringBuilder as your Accumulator instead).
Notice how in this example we didn’t return the same type as the collection that we operated over (i.e. Int32). We can use this technique to do all sorts of funky things to collections that you might not have considered before.
Conclusion
Aggregate is a rarely used but extremely powerful LINQ method. In my next post, I’ll build on this showing some more powerful (and perhaps useful!) examples of Aggregate.
LINQ in C#2
Introduction
Continuing my series of posts on LINQ, I wanted to give a simple example as to how one can get the same sort of functionality in terms of query composition and lazy evaluation by using the yield keyword and without using any of C#3’s features. Bear in mind that LINQ was introduced as part of .NET 3.5, which itself runs on the same CLR as .NET 2. So everything that happens with LINQ is “just” a set of compiler tricks and syntactic sugar etc. – at runtime there’s nothing that happens that can’t be done manually with C#2.
Here’s the task we’ll tackle: Get the next 5 dates that fall on a weekend.
Streaming data
In purely non-LINQ terms, we could easily carry out this operation as a while loop, bespoke for the problem at hand. However, this wouldn’t offer any of the benefits that an API like LINQ offers e.g. composability and reusability of operations, which is what we’re trying to achieve – so let’s assume we’re trying to use a query-style mechanism; also, we want to try to create something more like an Date query API that we could use to write other, similar queries in future.
In C#3 using LINQ we might use Where() to filter out non-weekend days and then Take() to retrieve ten items. But there’s an initial challenge that we encounter when trying to do this query with LINQ – what exactly do we query over – what set of data do we operate over? There’s no static “All Dates” property in .NET, and we don’t know in advance the set of dates to query over. This is where yield comes in. It allows us to easily create sequences of data that can be generated at run-time and queried over.
Take a look at this: -
Ignore the DisposableColour class – it just temporarily changes the colour of the console foreground. What’s more important is that this method returns something that masquerades as a collection of DateTimes – when in reality we’re generating a infinite stream of DateTimes, starting at the date argument provided. This collection has no end and you can never ToList() it to fully execute it. Well, you could try, but it would keep going until DateTime.Max is reached. It simply generates dates on demand starting from the provided date.
Implementing composable query methods
Given this stream, we can write two other methods which firstly filter out dates that do not match Saturday or Sunday and another one which will only “take” a number of items from the sequence and then stop: -
Notice with the above method, we only yield out dates that match the provided days required, otherwise it doesn’t give back anything and implicitly iterates to the next item. Next, here’s a generic implementation of Take. It returns the next item in the collection, and then when it has returned the required number of items, breaks the foreach loop.
Consuming composable methods
Imagine that the methods above lived in a self-contained API that allowed us to easily query DateTimes – here’s how we could use it to answer our original question: -
All we do is generate the stream of DateTimes and then pipe them through the two other methods. The beauty of this is that because we’re yielding everything from the first method to the last, we only generate DateTimes that are required.
The key is the CreateDateStream method i.e. the stream of Dates. We cannot generate "every” date in advance – that would be grossly inefficient; it’s much better to dynamically create a stream as required.
-
dateStream is a stream of all dates starting from DateTime.Today.
-
weekendDays is the filtered stream of dates from dateStream that fall on saturday or sunday
-
nextUpcomingWeekendDays is the stream of the first 5 items from weekendDays
If we run the code above, we get the following output: -
Look at the messages in more detail. We only created enough dates until we matched 5 weekend days. Only those dates that fall into the required filter criteria get streamed into the Take() method, and only those fall out into Main. When we’ve taken enough, Take() breaks the loop which ends the foreach.
Conclusion
Yield is one of the key enablers for writing lazily-evaluated queries and collections. Without it, your queries would be less composable as well as less efficient; streaming out data as required allows us to only generate that part of the data that we still require.
In our example, we could get the next ten upcoming days without filtering, because Take also operates on IEnumerable<DateTime> – we can simply chain up our methods as and where needed.
Lastly – we could easily change the signatures of the three API methods to make them Extension Methods to give us a more LINQ-style DSL. It looks more like LINQ, but it’s still exactly the same code: -
If you’re struggling with this, it might help you to write out the code yourself and step through it with the debugger to see the actual flow of messages, or try creating some simple yield-style collections yourself.
Psychic LINQ
A relatively short post on cargo cult programming, particularly related to LINQ.
LINQ is a fantastic technology. The idea of making a platform-agnostic query language is a fantastic idea. You can write the same query, in C#, over an in-memory list or a database and from the client point of view treat it in the same way. Isn’t it wonderful!
I’ve recently carried out a number of job interviews where candidates had to answer the following question:
If you wanted to find all Customers whose name was “Isaac”, why would you use a .Where() clause over a collection of T rather than using a foreach loop and manually construct the result set?
The results were varied. What I was looking for was a discussion of the benefits of declarative versus imperative code; what versus how etc.; composability of queries etc.
Strangely enough, the most common answer I got was "LINQ is faster than a foreach loop". Why? Either because LINQ somehow "optimises" code to make it faster, or because it "doesn’t need to loop through a collection – it just ‘does’ a where clause over the collection so that it instantly finds it". Almost as if C# is doing magic! In both cases the candidates could not justify their beliefs with evidence – it was just their feeling that that “must” be the case.
Now, lets talk about the reality. I would direct everyone to Jon Skeet’s fantastic EduLinq blog series to get an in-depth understanding of how LINQ over objects works, but always remember this simplification:
The only methods and properties that LINQ has for all its query methods are sourced from IEnumerator <T>
-
Boolean MoveNext();
-
T Current { get; }
That’s it. Think about that. There is no magic involved. If you do a Where() clause over a collection, you will enumerate the entire source. There is no “pixie dust” that will give LINQ the answer quicker than a foreach loop and an if / then statement – and bear in mind, foreach loops are just syntactic sugar over IEnumerator, just like the using statement wraps IDisposable.
Let there be LINQ
Just a quick post regarding use of the let keyword in LINQ, which I find to be somewhat under-used by many people. Whilst one benefit of it can be readability (i.e. aliasing sections of complex queries to aid understanding of the query), the other benefit can be performance.
There is indeed a cost associated with using it i.e. every time you use it, you’re effectively creating a new anonymous type to hold that plus whatever the previous result in your query pipeline. So if you chain up lots of lets in a query, that’ll have an impact on the query. However, there is a case where let can give large performance benefits: -
Compare that code with the following: -
This will eliminate a massive number of calls to Convert.ToInt32() and reduce the time taken to process that query by around 40%; the former sample took ~1400ms to run whereas the latter took only around 800ms.
Trying out RavenDB
I came across this the other day and decided to give it a whirl. My initial experiences have been generally quite positive.
I’m sure that the official site can explain it much better than I can but essentially RavenDB is different from a regular relational database in that there are no “tables” as such. Instead, the database stores entire object graphs as “documents”.
Disclaimer: This is not a conclusive review! Think of it more as my initial impressions and experiences given an evening playing around with it, looking through the online docs and searching through Google a bit.
Getting started
I was really impressed at how quick it was to get going – essentially just download and unzip the package from the website, run the Start batch file, and off you go! You get an HTTP endpoint that will return JSON for your queries (more on this later). You also get a Silverlight website which gives you a nice front end to query the database or just browse your documents (I should also point out that you can run the database “in process” as well). All in all, I had the server and was querying the database (you get a stock DB with some sample data) in under five minutes – really good.
From a coding point of view, it’s also very easy to get going. Add a reference to a couple of RavenDB assemblies (note – the Client folder contains the richest version of the API) and you’re ready to start working with the DB (it’s also on NuGet although I don’t know what version is on there). The API seems reasonably clear – the basic functions are accessible at the top level of the namespace, with more complex features in the .Advanced namespace – another good idea.
Adding documents
Adding data to RavenDB the first time was a painless experience. I wrote a simple routine to pull out all the music files on my hard disk and created a simple Artist/Albums/Tracks type hierarchy (very similar to what the sample Raven database contains. actually…). The code is as follows: -
I’ve elided the GetArtists() method – it just scans my hard disk for certain files and constructs the object graph. The session object is analogous to the ObjectContext for EF people – it follows the unit-of-work pattern etc., and the SaveChanges persists all modifications to the session. Easy.
The nicest thing is that this is literally all the coding required. No database creation scripts are required. No mappings to tables. No nothing – even EF’s code-first approach is more heavy-handed than this. In fact, in some ways this completely removes the impedance mismatch of database / object modelling – there is no ORM as the database stores the entire graph as a document.
Once an object is in the session, Raven change-tracks the objects, just like the EF ObjectContext, so you get updates for free.
This is all great stuff – you can be up and running, inserting stuff into your database on a clean install of Raven within a couple of minutes.
Querying the database
Here’s where things start to get interesting! So we’ve added a load of objects to our database, and now we want to query it. First off, there is a LINQ provider so that you can write simple queries without too much difficulty: -
Notice how the Id is stored as a String (this is the default – I think that you can amend the format though), but it works easily enough. Before someone flames me, I want to point out that there’s actually a Load() method which you should probably use for explicitly loading single entities rather than First().
You can also pretty easily do other basic queries that don’t do projections e.g.
I assume that RavenDB converts the IQueryable into HTTP requests, and in the Raven console you can see the queries coming in. I believe that RavenDB stores all your object graphs as JSON internally, and when you query this HTTP endpoint directly, that’s what you get back as well – nice: -
RavenDB Indexes
I don’t want to get into performance metrics too much here – I don’t know enough about RavenDB to go into depth about it – but I do want to talk about about Indexes as they seem to be a key part of Raven.
Whenever you make a LINQ query, RavenDB will try to build an Index to speed up performance. An Index in RavenDB terms is not like a SQL Server Index – as I understand it, it’s more like a cached view of data based on a query – almost like a Stored Procedure which caches the results. The performance benefits are quite large – for example, in the example query above (with the Count() > 10), the first time I ran the query it took around 2800ms; the second time it took just 73ms. Raven will silently create these temporary index dynamically and update the results of them in the background (although Raven makes no guarantees that Indexes will be up to date – although you can manually refresh them if required).
Accessing Indexes
There were some issues I had with indexes though. For example, these dynamic indexes get trashed when you stop and start Raven. So I thought “let’s try to save them so when we restart raven it’s still nice and quick”. I couldn’t get it to work. Let’s say we have that Album Count query from earlier. Raven mad an index automatically after the first time I executed the LINQ query. I then renamed it through the website so it got saved as a “permanent” index. When I restarted Raven and ran the same LINQ query, it didn’t know to hit that index so created a brand new temp index from scratch with exactly the same indexing query. If I tried to force Raven to use the saved index when writing in the LINQ query on the client, it failed to do the range search and threw an exception. Even when I directly queried the index through the Silverlight UI as a Lucene query, I failed there too – it would treat the Count field as a text field and therefore treated 20 as less than 3. I’m sure that there’s a way to do this, but I couldn’t figure it out from a scan through the documentation.
Projections and MapReduce
Another time I had to get my hands dirty was with projections. Let’s say we want to get a result back from the DB which gives us a summary of all Artists names, the number of albums, and the total number of tracks. Normally in e.g. Entity Framework you can do something like this: -
It won’t work in Raven. First it will complain because you can’t use an anonymous type on the projection. So you make a proper type – and then will simply get back a set of empty objects! As I understand it, this is the crux of the difference between document and relational databases. With a relational database, you can construct result sets by joining between tables etc. etc.. but you cannot do this with document databases (or rather, you don’t want to do this with document databases!).
You could get around this by reading all Artists onto the client and doing the projection there – but this would of course be inefficient (in fact, to discourage you from this sort of sloppiness, by default Raven will only return a maximum of 1024 documents in a single query and 30 queries per session!).
So how do we do projections? With indexes. In RavenDB, we can use MapReduce to construct pre-defined indexes – ironically these are written in LINQ, but are stored on the database rather executed on the client. I found a few articles, including this great blog post, on writing them, so I won’t reproduce it here. Suffice it to say that you write a couple of LINQ queries to perform your projection and then query that index in code by name (although there is a strongly-typed method for querying indexes, too). It then “just works”, nice and quickly etc.
The biggest “issue” I have with this sort of approach is that your application becomes closely coupled to implementation details of your database. Why should you care that there’s an index on the database in order to retrieve a result set? By putting all of the queries in a repository you can abstract it away I guess – it’s just that I’m used to not having to care about that in EF etc. and suddenly now we have this mix of query code on the DB and queries on the client – it’s like we’re back in the land of stored procedures for CRUD. Perhaps the best way to think of it is as if the IQueryable implementation of RavenDB doesn’t support certain methods e.g. GroupBy, Sum etc. etc.
Another problem I had was with Contains e.g. Where(artist => artist.Name.Contains(“van”)); This initially did not work; I then discovered that it expects a Lucene-style query to be put in there e.g. “*van*”. Then it works just great. But this, to my mind, changes the semantics of Contains – surely Contains should, by default, just do a wildcard search anyway?
Conclusion
I’ve only been playing with RavenDB for a few days, so this is by no means an exhaustive review or anything like that. I just was quite excited when I started using it and wanted to share my initial thoughts. There are probably mistakes in what I’ve written above – and in a sense that’s a good thing – all I’ve done so far are read through the RavenDB website and Googled around a bit when I got stuck. And with that I was able in, literally, just a few minutes to get up and running with inserting and querying etc.. The main problems I’ve encountered are more to do with the fact that one shouldn’t treat a document database like a relational database – they’re two different beasts that have different features and ways of working.
I’m really interested in using Raven more though – not only does it have some very nice features, and is easy to get up and running, but it’s a different way of looking at something that we often take for granted – I would urge you to give it a go as it might change the way you think about databases. Just be prepared to do a bit of digging around – I think that the documentation could be a bit deeper – a lot of the samples on the website don’t even mention the LINQ provider or how to create MapReduce indexes etc. etc..
Interesting use of declarative coding
Having been using LINQ since it first came out, I feel that I’m only now starting to really appreciate some of the applications for declarative coding. When I first heard about it, I was pretty sceptical about it, but I’m actually becoming a pretty big fan of it now. Here’s an example that I thought about a few days ago.
Someone asked me to write some code that would print out the first five answers of the first five times’ tables e.g.
-
1 x 1 = 1, 1 x 2 = 2, 1 x 3 = 3, 1 x 4 = 4, 1 x 5 = 5
-
2 x 1 = 2, 2 x 2 = 4, 2 x 3 = 6, 2 x 4 = 8, 2 x 5 = 10
etc. etc.
I actually initially thought “let’s do this with LINQ” but on the spur of the moment went back to my imperative coding roots and fumbled around with a couple of nested for loops… it probably ended up looking something like this:
(Imagine that limit is a const int of 5)
{
for (int inner = 1; inner <= limit; inner++)
{
var answer = outer * inner;
Console.Write (String.Format ("{0} x {1} = {2}, ", outer, inner, answer));
}
Console.WriteLine ();
}
(Please ignore the fact that there would be an extra comma at the end of each line…)
Anyway… I got back to thinking about how I could have written this in LINQ and (pretty quickly) ended up with this:
from inner in Enumerable.Range (1, limit)
let answer = outer * inner
select new
{
EndOfLine = inner == limit,
Text = String.Format ("{0} x {1} = {2}, ", outer, inner, answer)
};
foreach (var number in numbers)
{
Console.Write (number.Text);
if (number.EndOfLine)
Console.WriteLine ();
}
}
If you’re not used to working with LINQ queries and / or projection, this may seem a little strange to you. But it’s pretty simple really – the first two lines just build up a 5 x 5 matrix (remember that limit = 5), and then we simply apply the function to create a new anonymous type over it.
Takes up slightly more code, but to my mind, at least from a readability perspective, offers two main benefits over the imperative version: -
-
Clear separation between the logic / construction of the dataset, and the effort of printing them out. You could change the source data without any work required on the bit of code to print it out. Of course you could do that in the first example, but it’s not natural to think of that – when writing with for loops I just seem to naturally mash the two together parts into one.
-
Clarity of program flow. In the second example, it’s clear why we’re printing Console.WriteLine (), compared to the previous example where the WriteLine () is there almost implicitly at the end of the outer for loop. You can figure it out easily enough – but the second example, to me at least conveys the intent better.
I actually also went one step further and took the second example to its logical conclusion and put the number.EndOfLine conditional within the anonymous type itself (as well as the Console.WriteLine bit):
from inner in Enumerable.Range (1, limit)
let endOfLine = inner == limit
let answer = outer * inner
let text = String.Format ("{0} x {1} = {2}, ", outer, inner, answer)
select new
{
Print = endOfLine
? (Action) (() => Console.WriteLine (text))
: () => Console.Write (text)
};
foreach (var number in numbers)
number.Print ();
I’m not sure whether I like this or not – in some ways it’s quite elegant, but the amount of code has risen again, plus I’m not sure whether the practice creating anonymous methods as a result of a conditional, as a property of an anonymous type is really that great an idea vis a vis readability or not
Using LINQ queries instead of “for” loops
LINQ can easily replace for each loops, but when it comes to replacing old-style for loops, it’s a bit trickier – what exactly are you enumerating over? Nothing but a fictional set of integers.
Luckily the System.Linq.Enumerable namespace comes with a handy static method for just such these occasions. Consider the following code: -
var list = new List <int> (); for (int i = 0; i < 10; i++) list.Add (i * i);
You can actually easily rewrite this as a LINQ query like so: -
var list = from i in System.Linq.Enumerable.Range (0, 9) select i * i;
The Range () method generates an IEnumerable <int> collection of sequential numbers as per the parameters passed in from which you can select data over.
I think it’s very nice
More about LINQ
Some more decent LINQ videos mostly featuring Luca Bolognese, who was/ is Lead Programme Manager for the LINQ C# team. Excellent speaker.
- LINQ to SQL Pipeline Video with Luca Bolognese and Matt Warren
- Another TechEd Video (I think!)
- Anders Hejlsberg: The .NET Language Integrated Query (LINQ) Framework Overview
I’ve also found out the answer (or an answer) to my question regarding LINQ to SQL security – first I thought of using Stored Procedures as a way of doing this – but that’s going to be inefficient. Imagine the following scenario:
You want to build some set of complex queries from Northwind. Lots of different varieties of queries, perhaps some dynamic queries. Several options:
- You can right a load of SPs, one for each of them, but that kind of defeats the object of LINQ in a way as you are doing all the queries in SQL as SPs. But it works, I guess.
- You can use a set of core SPs such as CustomerSelAll, CustomerSelByCountry etc. etc. – and from there use LINQ to generate subqueries on that data. Problem with this approach is that you end up bringing back large amounts of data back to the client and then only using a small amount of it.
- You can ignore SPs completely and just use LINQ directly to the database. Easiest option but no security on views of data.
- OR – you can use Table Value Functions. These are – as far as I can see – the same as SPs which would return a result set really except these return a result set that you can query directly e.g.
SELECT *
FROM MyTvf()
WHERE SomeField = SomeValue
You can’t do this with a SP. So you can use this in conjunction with LINQ so that you can secure the function, and still do a query against that result set on SQL before returning it to the client. So if you have a TVF which returns all Customers by Country, you could something like this in LINQ:
var Query = from Cust in MyDataContext.CustomerSelByCountry ("UK")
where Cust.Orders.Count > 5
select Cust;
This will end up doing a query in T-SQL which calls the TVG CustomerSelByCountry and do a WHERE clause against it and only then return the results to the client. I checked the Query Plan and it does look like two selects take place though i.e. the TVF runs first and then the second query which does the WHERE clause. So it’s probably not as quick to run as a custom function or SP which does the entire query in one go, but I think it’s a decent half-way house. You get the flexibility of using LINQ to write flexible queries on your data, yet you can secure you data through TVFs or SPs.
I think what we’ll end up doing – if possible – is to let an application do selections of data against raw tables, but only allow modifications of data using SPs, which can be secured. Who knows.

