There is something really wrong with modern programmers. Very wrong
indeed.
Mailinator
creator Paul Tyma has a great
blog post on how he compresses our email by 90%. He has a simple LRU cache of lines and
consecutive lines from emails, so emails become a list of line string pointers where most are shared.
He also has a background thread doing rather stronger LZMA compression on large emails. He’s
winning.
(Its well worth reading the mailinator
blog post again and spend a minute thinking about how to store the individual lines from 3500
emails per second; how you’d need to fetch them to recover an email if someone wanted to read
it and so on. It turns out to be quite obvious that a normal Java hash-table and linked list is a just
fine way of doing it with excellent performance characteristics ;)
Thing is, in one of the comments on the Hacker News version of the article, someone wonders why he
didn’t use Redis for the LRU.
I think of the kindest, least shouting way
to hint that that’s not such a good idea; I say it might be higher latency to block on a Redis
LRU than using a in-process data-structure. Other commenters responds by saying
that Redis is known to be fast; that they have just chosen node.js and Redis for
their startup because of its performance.
This is wrong on so many levels. So many levels. I don’t single these people out - they are just a useful illustration. Their mindset is endemic in this industry. All around you, the new generation of programmers are making the same assumptions.
Lets try and distill this: this is comparing a hash table lookup to …. a hash table lookup over IPC!
Increasingly, projects are websites. You don’t have to be a power programmer; you
can cut-n-paste Javascript and run node.js and then you go get mongoDB or Redis as a backend and you’re
thinking you’re scalable! Get with the movement!
Now I happen to admire
node.js and Redis and their ilk. I’ve looked at their code-base on occasion; I’ve
even contemplated contributing. This is not about them. Its about the misusers of them.
The
people who misuse them are unlikely to have looked at their code-base, of course. They are unlikely to
actually understand anything about performance and scaling either.
Paul Tyma
describes Mailinator as being basically one machine with a handful of cores and those were spare-enough
to be used for optional background LZMA compression. The numbers are very impressive; there’s
far more traffic flowing through the Mailinator this second than normal projects get in a lifetime!
Just
imagine that Mailinator used Redis (problems with LRU eviction making some emails unreadable and all the
other misfit aside). Or just generally being built in the web 2.0 backend style. Make a guess at how
many components and boxes it would be. Imagine how CAP would make it problematic. Imagine how blogs
could be written about scaling Mailinator 2.0 :)
My point is that Mailinator is
something that can be done on one box using pre-web-2.0 server technology.
There’s
a whole mindset - a modern movement - that solves things in terms of working out how to link together a
constellation of different utility components like noSQL databases, frontends, load balancers, various
scripts and glue and so on. Its not that one tool fits all; its that they want to use all the shiny new
tools. And this is held up as good architecture! Good separation. Good scaling.
It
might start small. Implementing sideways and struggling with sharding and then chucking in a Redis
server for shared state and so on. Each step in their scaling story is working out how to make
increasingly painful patches and introduce new cogs to mitigate the cogs they’ve already
added.
In real life these systems take a long time to develop, run slowly and cost
money to host. Luckily most never actually have to scale. As they go live the engineers are
increasingly running around wondering why things are breaking and blissfully unaware that TCP is fragile
and that you might get stalls, broken sockets and such in the Elastic Cloud. The client libs never do
seem to include reconnect logic do they? Race conditions? Are they sure putting a SETNX lock in some
remote data-store is going to solve all the consistency issues?
The first enemy of
performance is multiple machines. The fewer machines that need to talk to perform a transaction, the
quicker it is. If you can use a local in-process data-store (sqllite, levelDB, BDB etc) you’ll
be winning massively. You might even fit your whole app onto one box. If you can use a reasonably fast
runtime language (I don’t put the dynamic languages in this pile, sadly; V8 is not fast)
you might also squeeze it onto one box. You might not even need tiers.
And there’s
nothing slow about developing this way either. There’s nothing slow about breaking it into
tiers and cogs later. Nothing slow about lobotomising and splitting server apps when you have a ‘rich
mans problem’ and need to scale. The only problem is that so many programmers are so sadly
unequipped to do so.
Rich Hickey has a great talk about this where he resurrects the excellent word complecting :)
i guess my lament is that we have so lost sight of mechanical sympathy.
This avoiding cogs mantra ought to be the default operating mode when considering a new
project. Cogs bad. Cogs bad! There, hopefully I’ve started us
a new movement :)
Footnote: Mailinator creator Paul Tyma has only good
posts, including opinions on asynchronous IO :) Go read his blog. Especially relevant are these
slides.
(As seen on Hacker News; more recently on reddit; have you seen it anywhere else?)