Tips on writing an EF-based ETL

I’ve been working on a relatively small ETL (that’s Extract-Transform-Load) process recently. It’s been written in C# using EF4 and is designed to migrate some data – perhaps a million rows in total – from one database schema to another as a one-off job. Nothing particularly out of the ordinary there; the object mapping is somewhat tricky but nothing too difficult. There were a few things I’ve been doing lately to improve the performance of it, and I thought I’d share some of those with you.

Separate out query and command. This is probably the hardest thing to change after you get started, so think about this up front. What I mean by this is this: let’s say you write some code that navigates through a one-to-many object graph and for each child, fires off a query to the DB to retrieve some other data, and then acts on that bit of retrieved data. You then discover that sometimes, you’ll have 5,000 children in the graph, which equates to 5,000 queries being fired off to the DB. Instead of this, why not just write a single set-based query which performs a “where in”-style query to retrieve all the data you’ll need in one go. Then you can, in memory, iterate over each of them one at a time. This will give you a big performance boost. In order to do this, you need to be careful to construct your code such that you decouple the bit that queries the data to retrieve a collection of object and the part that operates on each single object one a time, and then have a controlling method which orchestrates the two together. Doing this design upfront is much, much easier than trying to do it afterwards. In essence, if you know up front what data you need to operate on, try to pull as much of that in together rather than doing it bit-by-bit.

Keep your object graphs as small as possible. By this I mean do not read in fields or elements of the graph that you do not need. If necessary, construct DTOs and use EF’s projection capabilities to construct them, rather than reading back an entire EF-generated object graph when you only need 5 properties out of the 150 available.

Use the VS performance profiler. Or ANTS etc. The best part of the VS perf tool is the Tier Interaction Profiler (TIP). This monitors all the queries you fire off to the DB and shows you stats like how many of them there were, how long they took etc. – great for finding bottlenecks.

Avoid lazy loading. It’s seductive but will again negatively impact performance by quietly hammering the database without you realising it.

Use compiled EF queries. For queries that you are repeatedly calling – especially complex ones – compiling them will give you a nice boost in performance.

Keep the “destination” EF context load as low as possible. In the context of EF, this means NOT doing things like using a single session for the entire process. It’ll just get bigger and bigger as you add stuff to it. Instead, try to keep them relatively short lived – perhaps one for x number of source aggregate that you process.

Use No-Tracking for read-only (source) EF contexts. This means you can essentially just re-use the same context across the whole application as the context just becomes a gateway to the DB, but nothing more.

Do not batch up too much in one go. The ObjectStateManager that tracks changes behind the scenes of an EF context is a glutton for memory. If you have a large graph to save, top even 500 or 1,000 insertions and you’ll see your memory footprint creeping up and up; calling SubmitChanges() on a regular basis can alleviate this (at least, that’s the behaviour that I’ve observed).

Separate out writing to reference and entity data. If you are inserting reference data, create an entirely separate context for it. Do NOT share the entities with your main “destination” entity model. Instead, just refer to reference data items by ID. The benefits of doing this are that you can much more easily cache your reference data without falling foul of things like attaching the same object into multiple contexts etc.

Ask yourself if you really need to use EF. There are many forward-only data-access mechanisms available in .NET that out-perform EF. For reading from the source database, you could use something like Dapper or Massive intead of EF. I can’t comment on the performance of Massive, but Dapper is certainly extremely quick. You will lose the ability to write LINQ queries though, and will have to manually construct your entire source database domain model. Again though, that may not be such a bad thing if you design DTOs that work well in a set-based query environment.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s