lsmtree | Rahul Gupta

I came across log-structured merge trees while reading about how databases turn random writes into sequential ones, buffering in memory and flushing sorted files to disk. I wanted to see what it actually takes to build one, so I started implementing it in Go, working through the design decisions one component at a time.

It follows the same core design behind LevelDB and RocksDB: each write is durably logged to a write-ahead log, applied to a sorted in-memory structure, and eventually flushed to sorted files on disk.

The write path

Every write lands in two places before Put returns. It's first appended to the write-ahead log (the WAL: an append-only file, one CRC32-checksummed record per write, that exists so an unflushed memtable can be rebuilt after a crash). Then it goes into the memtable, the in-memory write buffer, which here is a skip list: a linked list with probabilistic express lanes that keep it sorted with O(log n) inserts and none of a balanced tree's rebalancing on the hot path. A single put costs about 1.8μs, almost all of it the write() syscall to the WAL.

That syscall is the bottleneck: one per key doesn't scale. So I added a WriteBatch API that encodes many entries into a single buffer and commits them with one write(). Under sustained write load that reaches around 440K entries/sec, roughly 19x faster than single puts on the same workload.

The other place writes can stall is the flush, when a full memtable has to be written out as a sorted file on disk (an SSTable). That I/O is slow, so none of it runs on the write path. When the memtable crosses its size threshold, the writer, still holding the lock, does only fast in-memory work:

it freezes the active memtable and swaps in a fresh one, so incoming writes never block;
it rotates the WAL, opening a new log for the new memtable while keeping the old log until the flush is durable;
it signals a background goroutine through a channel and returns.

The background goroutine serializes the frozen memtable into an SSTable, opens it for reading, then deletes the now-redundant WAL and drops the frozen memtable. The writer's added cost is a pointer swap and a channel send, which took per-write latency under load from 2.6ms down to 40μs.

The read path

A read walks the levels newest to oldest: the active memtable first, then the immutable memtable if a flush is in flight, then each SSTable from the most recent down. Memtable hits are the fast path at 10.7M/sec with zero heap allocations, since the lookup is just pointer chasing through the skip list.

The interesting part is what keeps the SSTable path from being slow. Each SSTable carries a bloom filter (a compact probabilistic set that answers "definitely absent" or "probably present", never with a false negative), so a lookup for a key that was never written gets rejected in 131ns at 7.6M/sec without touching a single data block. A read that does hit disk finds its block through a sparse index and decodes it, landing at ~347K/sec: the slowest path in the system, but one you only pay when the key is actually there.

Crash recovery

On startup the engine scans the data directory, opens every SSTable it finds, and replays the leftover WAL entries (the ones from a memtable that was never flushed) into a fresh memtable, which it flushes to a new SSTable before clearing the old logs. A crash loses at most the writes that never reached the WAL. Writes go through write() but not fsync by default, so they survive a process crash on the OS page cache; only a power loss or kernel panic before that cache is flushed can drop them, and switching on synchronous writes trades throughput to close the window.

Compaction

Once enough SSTables pile up (four, by default), a background goroutine compacts them. A heap-based merge iterator walks all of them in sorted order, and wherever the same key appears in more than one, the newest version wins, with a later write or a tombstone shadowing everything older. The merged result is written as a single new SSTable. The subtle part is in-flight iterators: a range scan that started before compaction is still reading the old SSTables, so their readers can't be closed the moment the merged file is ready. They move to a stale list and get closed a cycle later, at the next compaction or on shutdown, once nothing can still be pointing at them.

Benchmarks

All numbers from a 2020 MacBook Air (M1, 8 GB RAM), measured with go test -bench=. -benchmem.

Operation	Throughput	ns/op	Allocs
Single put	550K/sec	1,818	1
Batch put (100, sustained)	440K entries/sec	2,277/entry	15/entry
Get (memtable)	10.7M/sec	93	0
Get (SSTable)	347K/sec	2,879	130
Get (bloom miss)	7.6M/sec	131	0
Scan 1000 keys	66K scans/sec	14,995	7

With 1M keys inserted, 100K random lookups take 0.645s total (6,452 ns/op). Negative lookups via bloom filter take 0.021s (213 ns/op).

Usage

As a Go library:

db, _ := lsmtree.Open("./data", lsmtree.DefaultDBOptions())
defer db.Close()
 
db.Put([]byte("key"), []byte("value"))
val, _ := db.Get([]byte("key"))
 
batch := db.NewBatch()
batch.Put([]byte("k1"), []byte("v1"))
batch.Put([]byte("k2"), []byte("v2"))
db.Write(batch)

Or through the CLI:

go install github.com/rahulgpt/lsmtree/cmd/lsmtree@latest
lsmtree ./mydb