← back to home

lsmtree

20269 viewsView on GitHub
GoStorage EnginesSystemsConcurrencyData Structures

A log-structured merge tree key-value storage engine written in Go

I came across log-structured merge trees while studying how databases handle writes efficiently — turning random disk writes into sequential ones by buffering in memory and flushing sorted files. After reading more about the internals from various sources, I wanted to see what it actually takes to build one, so I started implementing one in Go, working through the design decisions one component at a time.

It follows the same core design behind LevelDB and RocksDB: writes go to a sorted in-memory structure first, get durably logged through a write-ahead log, and eventually flush to sorted files on disk.

The write path

Every write is first appended to a write-ahead log for durability, then inserted into a skip list that serves as the memtable. Individual puts cost about 1.8μs each, most of which is the write() syscall to the WAL.

The bottleneck became obvious pretty quickly — one syscall per key doesn't scale. So I added a WriteBatch API that encodes multiple entries into a single buffer and writes them in one shot. That brought sustained throughput to around 440K entries/sec, roughly 19x faster than individual puts.

Flushing the memtable to disk was the other place where writes could stall. I made the flush non-blocking — when the memtable hits its size threshold, the write goroutine freezes it, rotates the WAL, and signals a background goroutine through a channel. The actual SSTable I/O happens entirely off the write path. That dropped per-write latency from 2.6ms to 40μs.

The read path

Reads check the active memtable first, then the immutable memtable if a flush is in progress, then walk SSTables from newest to oldest. Memtable lookups are fast — 10.7M/sec with zero heap allocations since it's just pointer chasing through the skip list.

The interesting part was bloom filters. Each SSTable has one, and checking it costs almost nothing (7.6M/sec, zero allocations) because we never touch the data blocks for keys that aren't there. SSTable reads that actually hit disk are slower at ~347K/sec due to block decoding overhead.

Crash recovery

On startup, the engine scans the data directory for existing WAL and SSTable files. Any WAL entries get replayed into a temporary memtable, flushed to a new SSTable, and the old WAL files are cleaned up. A crash mid-operation loses at most the writes that hadn't reached the WAL yet — in practice, that's nothing unless the kernel crashes before flushing the page cache.

Compaction

As SSTables accumulate, a background goroutine merges them into a single sorted file using a heap-based merge iterator. One thing I had to handle carefully was in-flight iterators — old SSTable readers can't be closed immediately if a scan is still reading from them. They get moved to a stale list and cleaned up at the start of the next compaction cycle or when the database closes.

Benchmarks

All numbers from an Apple M1, measured with go test -bench=. -benchmem.

OperationThroughputns/opAllocs
Single put550K/sec1,8181
Batch put (100, sustained)440K entries/sec2,277/entry15/entry
Get (memtable)10.7M/sec930
Get (SSTable)347K/sec2,879130
Get (bloom miss)7.6M/sec1310
Scan 1000 keys66K scans/sec14,9957

With 1M keys inserted, 100K random lookups take 0.645s total (6,452 ns/op). Negative lookups via bloom filter take 0.021s (213 ns/op).

Usage

As a Go library:

db, _ := lsmtree.Open("./data", lsmtree.DefaultDBOptions())
defer db.Close()
 
db.Put([]byte("key"), []byte("value"))
val, _ := db.Get([]byte("key"))
 
batch := db.NewBatch()
batch.Put([]byte("k1"), []byte("v1"))
batch.Put([]byte("k2"), []byte("v2"))
db.Write(batch)

Or through the CLI:

go install github.com/rahulgpt/lsmtree/cmd/lsmtree@latest
lsmtree ./mydb
1kudos