Will Edwards
programming

I have always loved programming - its like Lego without gravity.

Basic on my ZX81 graduating to assembler and Turbo Pascal during my teens.

Developed phone OS software - engineer, architect, product manager - but got made irrelevant by the iPhone and redundant by Android.

These days I mostly work with data, big data and fitting big data onto small boxes.

How I got massively faster DB with in-process replication

In my essay How I got massively faster DB with async batching I described how to make a write-heavy database web-app go fast. In this essay I will describe a technique I touched on which makes read-heavy database web-apps go fast!

Most normal corporate database web apps are either blazingly fast or blazingly slow at the five week mark; either nobody is using them hence they are blazingly fast, or everyone is using them hence they are blazingly slow.

This is because the straightforward way to connect a database to a web app is naive.

A very large majority of databases are read-heavy and small. Easily small enough to be stored in application memory. How many rows in your product database? 10K perhaps? How many users in the corporate directory? How many pages in the knowledge base? If the answer is less than a few million, and if the updates are less than a few hundred a second (or are only in big bulk batched update periods, as is common with product databases), then this approach will work for you:

Simply store your database in RAM on your web server!

Its so obvious, you knew I was going to say it. So why doesn’t everybody do it? (I’ll talk about distribution below:)

You obviously want to keep an authoritative copy of the database on some reliable media, and a proper database server is ideal for this. But you want reads and queries to be against a local copy rather than each being sent to the server to be answered. This even allows you to maintain views of the data in RAM and invalidate or recalculate them as needed.

Sadly I am unaware of any good frameworks for this (although with some pain and lots of gotchas sqlalchemy does it somewhat, if not transparently). It should be enough to say ‘store the entire contents of this table in RAM’ and the ORM goes away and does that.

I have made custom frameworks that do this. My own frameworks have been asynchronous (as in you can watch a row and get a callback when its committed or superseded and see its previous versions and such) and mixed in with the async batching I described previously. But its a generic thing and it could be synchronous.

(Python is sorely lacking an ordered collection; I use one of the many native tree eggs out there, but this really ought to be part of the language runtime.)

An issue becomes having several worker processes serving the same database. The first approach to try is don’t. The first rule of distribution is don’t. Cogs bad, remember? Sometimes though (and sooner if, like me, you use a lot of Python) you have to have multiple workers to sate the user demand to get at your data!

In these circumstances, you normally use the database to serialize access. As in all the workers can only scale sideways as much as the database can keep up.

When you have the tables in RAM in each process instead, you have to start trying to keep them in sync. At this point having a framework do this rather than having to roll your own is a big plus! But its possible:

If eventual consistency is acceptable you can use message queues to do this (PostgreSQL 9 even has a pub-sub system built in). Or you can use another message queue.

My mind races ahead and imagines using the normal cluster replication log stream from the real database being subscribed to by the web workers. Or some Paxos service. The key thing is that despite writes being slow, reads are fast because reads are not going out-of-process.

Just caching things in memcached is much slower than keeping an authoritative copy of them in-process. Memcache comes into play when your dataset is much larger than the RAM of a single web-worker, and that’s just not true in the use-case I am describing here.

Doing complex queries by the intersection of Python sets is pure pleasure and blazingly fast :) If you are using a faster runtime, blazingly faster ;)

posted 2012-03-15

Will Edwardsprogramming

How I got massively faster DB with in-process replication

jump to ↓

Will Edwards
programming