Will Edwards
programming

I have always loved programming - its like Lego without gravity.

Basic on my ZX81 graduating to assembler and Turbo Pascal during my teens.

Developed phone OS software - engineer, architect, product manager - but got made irrelevant by the iPhone and redundant by Android.

These days I mostly work with data, big data and fitting big data onto small boxes.

Cogs Bad

There is something really wrong with modern programmers. Very wrong indeed.

Mailinator creator Paul Tyma has a great blog post on how he compresses our email by 90%. He has a simple LRU cache of lines and consecutive lines from emails, so emails become a list of line string pointers where most are shared. He also has a background thread doing rather stronger LZMA compression on large emails. He’s winning.

(Its well worth reading the mailinator blog post again and spend a minute thinking about how to store the individual lines from 3500 emails per second; how you’d need to fetch them to recover an email if someone wanted to read it and so on. It turns out to be quite obvious that a normal Java hash-table and linked list is a just fine way of doing it with excellent performance characteristics ;)

Thing is, in one of the comments on the Hacker News version of the article, someone wonders why he didn’t use Redis for the LRU.

I think of the kindest, least shouting way to hint that that’s not such a good idea; I say it might be higher latency to block on a Redis LRU than using a in-process data-structure. Other commenters responds by saying that Redis is known to be fast; that they have just chosen node.js and Redis for their startup because of its performance.

This is wrong on so many levels. So many levels. I don’t single these people out - they are just a useful illustration. Their mindset is endemic in this industry. All around you, the new generation of programmers are making the same assumptions.

Lets try and distill this: this is comparing a hash table lookup to …. a hash table lookup over IPC!

Increasingly, projects are websites. You don’t have to be a power programmer; you can cut-n-paste Javascript and run node.js and then you go get mongoDB or Redis as a backend and you’re thinking you’re scalable! Get with the movement!

Now I happen to admire node.js and Redis and their ilk. I’ve looked at their code-base on occasion; I’ve even contemplated contributing. This is not about them. Its about the misusers of them.

The people who misuse them are unlikely to have looked at their code-base, of course. They are unlikely to actually understand anything about performance and scaling either.

Paul Tyma describes Mailinator as being basically one machine with a handful of cores and those were spare-enough to be used for optional background LZMA compression. The numbers are very impressive; there’s far more traffic flowing through the Mailinator this second than normal projects get in a lifetime!

Just imagine that Mailinator used Redis (problems with LRU eviction making some emails unreadable and all the other misfit aside). Or just generally being built in the web 2.0 backend style. Make a guess at how many components and boxes it would be. Imagine how CAP would make it problematic. Imagine how blogs could be written about scaling Mailinator 2.0 :)

My point is that Mailinator is something that can be done on one box using pre-web-2.0 server technology.

There’s a whole mindset - a modern movement - that solves things in terms of working out how to link together a constellation of different utility components like noSQL databases, frontends, load balancers, various scripts and glue and so on. Its not that one tool fits all; its that they want to use all the shiny new tools. And this is held up as good architecture! Good separation. Good scaling.

It might start small. Implementing sideways and struggling with sharding and then chucking in a Redis server for shared state and so on. Each step in their scaling story is working out how to make increasingly painful patches and introduce new cogs to mitigate the cogs they’ve already added.

In real life these systems take a long time to develop, run slowly and cost money to host. Luckily most never actually have to scale. As they go live the engineers are increasingly running around wondering why things are breaking and blissfully unaware that TCP is fragile and that you might get stalls, broken sockets and such in the Elastic Cloud. The client libs never do seem to include reconnect logic do they? Race conditions? Are they sure putting a SETNX lock in some remote data-store is going to solve all the consistency issues?

The first enemy of performance is multiple machines. The fewer machines that need to talk to perform a transaction, the quicker it is. If you can use a local in-process data-store (sqllite, levelDB, BDB etc) you’ll be winning massively. You might even fit your whole app onto one box. If you can use a reasonably fast runtime language (I don’t put the dynamic languages in this pile, sadly; V8 is not fast) you might also squeeze it onto one box. You might not even need tiers.

And there’s nothing slow about developing this way either. There’s nothing slow about breaking it into tiers and cogs later. Nothing slow about lobotomising and splitting server apps when you have a ‘rich mans problem’ and need to scale. The only problem is that so many programmers are so sadly unequipped to do so.

Rich Hickey has a great talk about this where he resurrects the excellent word complecting :)

i guess my lament is that we have so lost sight of mechanical sympathy.

This avoiding cogs mantra ought to be the default operating mode when considering a new project. Cogs bad. Cogs bad! There, hopefully I’ve started us a new movement :)

Footnote: Mailinator creator Paul Tyma has only good posts, including opinions on asynchronous IO :) Go read his blog. Especially relevant are these slides.

(As seen on Hacker News; more recently on reddit; have you seen it anywhere else?)

posted 2012-02-22

Will Edwardsprogramming

Cogs Bad

jump to ↓

Will Edwards
programming