Had an interesting discussion on twitter regarding storing relational data and how to essentially pre-render joins across data to improve performance at the cost of storage space. To my knowledge, this isn’t really possible, and goes against the ethos of relational databases anyway i.e. avoid duplication of data, one record for any bit of information etc. etc.. Anyway, the point being made was that storage space is dirt cheap these days, whereas CPU time isn’t.
A lot of people don’t really think of “data” and “compute” as two sides of the same coin, but they really are. Think of the example above – joining two (or more) tables together in SQL is always going to be slower than going to a single table where the data is already pre-calculated. With document databases this sort of approach is more common, where you might have many ways of representing the same data, and indeed the same pieces of data duplicated across many documents, but it’s often frowned upon in the SQL world.
CPU and Data costs in Azure
So, I thought to illustrate the disparity between data and CPU I would just show some costs of hosting data and compute in Azure (at the time of writing) if your budget is £100 / month: –
|Relational||SQL Web / Business||1 x 83GB database|
|2 x 26GB databases|
|3 x 13GB databases|
|CPU||Extra Small (1ghz, 768mb)||10 Instances|
|Small Instance (1.6ghz, 1.75gb)||2 Instances|
|Medium Instance (2 x 1.6hz, 3.5gb)||1 Instance|
Look at those numbers.
You can have a single database upto 83GB in size for £100 OR you can have 2.3TB of Blob / Table / Queue storage. That’s an absolutely ridiculous amount in comparison – over 20x as much. Of course, with Table storage you don’t get a schema as such, and have to worry about partitioning data etc. – it’s nowhere near as easy to get up and running as a SQL database that we all know (and some of us love). Nor can you do any compute on the table store as you can with SQL e.g. stored procedures, joins, group by etc. etc. But for purely data storage, there’s no comparison. Alternatively, compare our 2TB+ of schema-free data with some CPU prices – two small instance VM (or web / worker role) will go for the same amount, or just a single medium instance. I can’t even include SQL premium or more than a medium instance VM in the above table as they clock in at over £100 a month.
Future compute can be traded for upfront data – the latter is relatively cheap compared to the former.
You can think of this in pure functional terms e.g. a function Factorial which takes in a number and returns the factorial of it – you could either calculate that on demand, or calculate it once-off and then store those results in a table. The cost of hosting those results is tiny compared to the cost of repeatedly calculating them on demand.
Now obviously, the above results never change; a more real-world example might be to imagine that you have a system where you need to host two worker roles to handle your compute demands that involves calculations on some data (e.g. website queries to find products that meet some filter etc.). Why not consider pre-calculating those results where possible and storing them in an Azure Table instead? A batch process could update the aggregations where required in order to keep these results as fresh as possible. Even if the table takes up e.g. 50gb – this is a mere pittance compared to the cost of a single small worker role which equates to roughly 1TB of table storage in terms of price. Or compressing data to “save” on disk space at the cost of the CPU cycles required to decompress it every time you want to read it back out?
Makes you think a little bit (at least, it did for me)!