Will Edwards
programming

I have always loved programming - its like Lego without gravity.

Basic on my ZX81 graduating to assembler and Turbo Pascal during my teens.

Developed phone OS software - engineer, architect, product manager - but got made irrelevant by the iPhone and redundant by Android.

These days I mostly work with data, big data and fitting big data onto small boxes.

Google: MoreSQL is Real

dead link: http://img197.imageshack.us/img197/959/moresql.png

The post NoSQL no more: let’s double down with MoreSQL is said tongue-in-cheek. But its more true and serious and real than the NoSQL crowd want to admit. And Alex made a sweet logo.

As far as I can determine, NoSQL really means two things - NoJoins and NoORM - usually together.

How did we get into NoSQL anyway?

Its been appreciated for a long time that joins in database queries can be bad for performance. MySQL made its name with an unreliable storage engine called ISAM that didn’t enforce foreign key constraints. The only thing going for MySQL was speed and cost, which was a really good combination. Everybody started using it.

The CAP theorem just confirmed everyone’s suspicions that relational databases don’t scale^[1].

Then Google published some papers on a fancy thing called map-reduce and their distributed data-stores and all the problems of ‘scale’. Amazon joined the scale-talk party. Everyone gasped and aped this, despite there being almost nobody working at web-scale. Ouch, I said that dreaded term, web-scale. Now I need a cup of tea to calm down.

So the rest of us set out to do away with joins, throw out SQL too and head down a new rabbit-hole of non-ORM object-something-mappings. Really the whole ORM vs not-quite-ORM-but-still-a-restrictive-set-of-cursor-based-interface thing is not making scaling easier for anyone - its all about sufficiently smart compilers and just pushing the impedance mismatch between coder and data-store around. But that’s a digression from the main topic.

Everyone is hyping up and jumping on unreliable (but supposedly fast) data-stores with cute trendy names. Names that, at the end of this decade, will sound dated.

The thing is, we’ve been trying to emulate Google. Build like Google, build web-scale, and they will come (customers, that is. Well, strictly, users. Users are the product, but that’s yet another digression).

And what have Google been doing at that time?

Tenzing SQL. Turns out, they are running SQL over their distributed data store, with special handling for tricky things like joins, and their own procedure language as well. Lots of very real new techniques and applied research. Really moving things forward.

And now F1 distributed RDBMS too

a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost beneﬁts so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS

What will we do now?

We look to Google to lead the web-scale way. Its just a question of time before the mainstream web-scale crowd discover the Tenzing paper and embark on an SQL layer over Hadoop or whatnot.

At the same time, PostgreSQL just keeps performing. (Its widely ignored that it stores key-values.)

Google has open-sourced a key-value store called LevelDB. Its a sorted key-value store. Its basically an index - the building block of a relational data-store. I can imagine a new open-source distributed SQL data-store that used LevelDB as the storage engine.

The technology that most excites me is fractal tree databases. Everyone is using B and B+ trees in their databases, and even filesystems. And then some clever chaps in academia invent and apply fractal trees and set up a buisness called Tokutek selling them. 20x to 80x performance, they claim.

SSTable (e.g. LevelDB) vs Fractal Tree would be an interesting performance comparison.

In my own benchmarking (of mainstream noSQL and MySQL), every data-store is getting around 5K inserts/sec if you turn on any kind of reliability setting. This is because of my disks. Something that promises 20K or more on the same hardware is going to get me excited.

So lets get back to the ORM thing shall we? I distrust ORM with a passion. The thickness of books attempting to explain the arcane gotchas of Hibernate can be measured in inches.

I understand PostgreSQL comes with an asynchronous drivers so I guess you might be able to emit a bunch of different queries and PostreSQL might have an opportunity to reorder or merge them. But I rather doubt it does, that smacks of sufficiently-smart-compiler syndrome to me. (UPDATE: commenter has confirmed that it doesn’t do pipelining.)

In my own experiments, I found databases to be fastest when my own code buffers DB interaction and reorders things and coalesces things so the DB is getting a few queries each with more payload. ORMs aren’t doing this. To get good performance you have to cut out all the abstractions and layers that are making you unsure what’s going over the wire to the DB and you have to write your own prepared query statements yourself, because you know the architecture of the other end. You are the only compiler sufficiently smart to work all this out.

So I think this decade will be the decade web-scale rediscovers joins. The decade we all join the SQL renaissance. And, hopefully, the decade of the fractal tree.

Of course, mid-decade, Google could go publish yet another paper, this time send us all of in some different direction. That’s a risk that’s worth hoping for.

^[1] CAP is almost too simple to explain; so many think they grasp it and its implications. That’s a rant for another time, though, trust me :)

Read all the way to the bottom? That’s a vote of confidence. Now go back to the referring site and vote with conscience! :)

Liked this? You might like something completely different too: Do Something Meaningful - Coding for the World

posted 2012-01-24

Will Edwardsprogramming

Google: MoreSQL is Real

How did we get into NoSQL anyway?

And what have Google been doing at that time?

What will we do now?

jump to ↓

Will Edwards
programming