Part 8 — Benchmarks: what an agent-memory workload actually looks like¶
Series: Long-Term Memory in EvolutionDB — previous: Part 7, Multi-tenant memory.
The launch blog quoted six numbers in a small table. This article unpacks all six: where the workload comes from, how the harness runs, why an agent-memory workload looks nothing like the TPC-C shape we'd reflexively build for a database benchmark, and what we chose not to claim.
What we're trying to measure¶
A traditional database benchmark answers how fast can the engine do X under sustained load? — TPC-C, YCSB, sysbench. An agent-memory benchmark has a different question: how does the engine behave under the access pattern an actual agent produces?
That distinction matters because the access pattern is unusual. A typical agent step:
- One
MEMORY GETwith a known key (cache-hot, sub-millisecond). - One
MEMORY SEARCHwith a fresh embedding (vector index hit, a few milliseconds). - Zero or one
MEMORY PUTof the conclusion (DML, MVCC, WAL). - One
CHECKPOINT PUTof the agent's reasoning state at the end of the step (DML, often the largest payload). - A NOTIFY or two for downstream listeners.
Steps run serially on one logical thread per agent (because LLM calls are the bottleneck, not the database), but many agents run concurrently. The shape is bursty point-lookups + a few large writes per step, not the sustained scan-and-update of an OLTP benchmark.
The harness lives at bench/run_all.py, with sub-suites in:
bench/latency/— per-operation latency for the six primitive verbsbench/reactive/— push-vs-poll comparisonbench/temporal/—AS OFqueries against the same store at different agesbench/longmemeval/— open-domain accuracy on a public agent-memory benchmark datasetbench/vendors/— placeholder runners for the cross-vendor sweep (Zep, Mem0, langgraph-store-mongodb, Pinecone) deferred to v3.2
The latency table, decoded¶
The six numbers from the launch blog, with how they were measured:
| op | p99 | what it means |
|---|---|---|
MEMORY PUT |
~ 8 ms | One MEMORY PUT INTO ... round trip, single Python client, single-process server, fsync at commit |
MEMORY GET |
~ 2 ms | One PK lookup, B+ tree → tuple decode, one round trip |
CHECKPOINT PUT |
~ 5 ms | One CHECKPOINT PUT, JSON payload ~1 KB, fsync at commit |
MEMORY SEARCH top-10 |
~ 4 ms | 10k-row store, brute-force HNSW (Part 3), top-10 by cosine |
| NOTIFY push delivery | ~ 0.4 ms | publish-to-receive on the same EVO connection |
| polling at 1 s interval | ~ 990 ms | publish at random offset within the 1-second polling window |
A few honest caveats:
It's the bundled Python client. The C SDK, the Go binding, and the Rust binding all show numbers within ±10% of these — the SDK overhead is small relative to engine time — but the table is consistent on Python because that's the one most readers will reproduce.
It's single-process. No replication, no TLS, no encryption. Each of those changes the picture: synchronous-commit replication adds network latency to every PUT (typically +1-3 ms across a LAN); TLS adds about 100 µs per round trip on first TLS use, less on session resumption; TDE adds about 2-3 µs per page (which is rounding error at this scale). The launch table represents the "barebones happy path" so the relative magnitudes are interpretable.
The numbers are p99, not average. Average is ~30% lower across the board, but quoting averages would be misleading: an agent that hits a 50 ms outlier on every fifteenth call feels slow, even if the average is fast. p99 is the right "what does this feel like" metric.
Hardware was a 2024-class developer laptop. Apple M3 Pro, NVMe, no other processes contending. Server-class hardware is faster but varies more wildly; we report what's reproducible on commodity gear.
The reactive comparison¶
The 2,900× number from the launch blog is the ratio between the last
two rows: 990 / 0.4. The benchmark for both lives in
bench/reactive/. It's the simplest of the suites:
```python
Push case¶
sub = conn.subscribe("test_channel", on_event) publisher.notify("test_channel", "hello")
measure publisher_send_time → on_event entry time¶
Poll case (separate run, otherwise the push subscriber would also notice)¶
publisher.notify("test_channel", "hello") # at random offset in [0, 1.0s] while True: rows = conn.query("SELECT * FROM events WHERE seq > %s" % last_seq) if rows: break time.sleep(1.0)
measure publisher_send_time → loop_break_time¶
```
The publisher and the consumer are different processes. The clock is the wall clock on a single host (so there's no clock-skew correction to argue about). Each case runs 10,000 iterations; we report p50, p95, p99 for both.
The interesting thing isn't the headline ratio — it's that the push distribution is tight (p50 ≈ 200 µs, p99 ≈ 400 µs, both small enough to be noise from other processes) and the poll distribution is wide (p50 ≈ 500 ms by construction, p99 ≈ 990 ms). For an agent ticking through a long task, the wide distribution is the part that hurts: an unlucky run experiences worst-case poll latency on every event, and the cumulative drag is much worse than the average ratio suggests.
The temporal benchmark¶
bench/temporal/ measures FOR SYSTEM_TIME AS OF against a store
that's been written to at known transaction IDs. The shape is:
- Populate the store with 10k rows under XID 100.
- Update half the rows under XID 200.
- Update the remaining half under XID 300.
- Run point queries
SELECT mem_value FROM ... FOR SYSTEM_TIME AS OF TRANSACTION N WHERE pk = '...'forN= 100, 200, 300.
The pass condition is two-fold: every query at N=100 returns the
original value, every query at N=300 returns the latest value, and
the median latency for the historical query is within 50% of the
latency for the live query. The 50% margin is generous; in practice
the difference is closer to 5%, because the visibility predicate
runs the same number of comparisons in both cases.
This benchmark is the empirical evidence behind the claim from Part 4 that temporal queries cost nothing extra at runtime. It's also the test that would fail interestingly if we accidentally stopped using the CSN cache: the live query would stay fast, the historical query would slow down, and the ratio test would catch it.
LongMemEval¶
bench/longmemeval/ runs the public LongMemEval dataset — five
question categories, each with a set of ground-truth Q&A pairs over
a multi-session conversational history. The harness:
- Ingests the conversation history into a
MEMORY STORE(with a placeholder embedding pipeline; v3.0 ships with a lexical-fallback scorer because we didn't want to bake in a specific embedding provider). - Runs each question through
MEMORY SEARCH, takes the top-3 hits. - Asks an LLM to answer the question conditioned only on those hits.
- Compares the answer to the ground truth.
The current scoreboard:
| category | recall@10 | answer accuracy |
|---|---|---|
| single-session-user | 1.0 | 0.78 |
| single-session-assistant | 1.0 | 0.74 |
| multi-session | 1.0 | 0.65 |
| temporal | 1.0 | 0.61 |
| knowledge-update | 1.0 | 0.69 |
Two notes:
The recall is 1.0 everywhere because the v3.0 retrieval is
brute-force against a tractable corpus; once the HNSW graph
implementation lands (Part 3), the recall numbers become real and
will sit somewhere below 1.0 by design.
The cross-vendor comparison (us vs Zep, Mem0, langgraph-store-mongodb, Pinecone) was deferred to v3.2 because each vendor needs its own Docker image plus a stable embedding pipeline, and we didn't want to ship questionable comparison numbers in the launch. The harness is ready; the comparison run is bookkeeping.
What we deliberately don't claim¶
A few benchmarks we could run but won't, because they'd mislead:
Throughput at maximum concurrency. We can drive the server hard
enough to hit g_parse_lock contention and produce a flattering
"100k QPS sustained" number. But at that load every individual query
is slower than the table above suggests, and an agent stepping
serially never sees the throughput case anyway. We benchmark p99
latency at "natural" concurrency (a couple hundred threads, each
ticking through agent steps) instead.
Single-vector-search QPS. "Vector DB" benchmarks tend to lead
with this number. We could match it, but the only way agents access
vector search is bundled with the surrounding MEMORY GET / PUT /
NOTIFY pattern. A lone vector search QPS is a measurement of an
operation that doesn't happen in production.
Synthetic-skew workloads. Skewed key distributions are the classic way to stress test a B+ tree. Our workload is naturally skewed (recent memories dominate access frequency), and we benchmark on that distribution rather than synthesising a worse one. The worst-case is interesting; we'd rather measure the realistic case and improve the worst-case opportunistically.
Reproducing it¶
The whole harness is one command:
bash
docker compose up -d
python3 bench/run_all.py --suite all
The suite runs in about 8 minutes on the laptop class above; it
prints the per-op p50/p95/p99 table, the reactive comparison, the
temporal latency ratio, and the LongMemEval scoreboard. The output is
stored in bench/results/<timestamp>.json, which is the source for
the launch blog table.
There's also a GitHub Actions workflow that runs the full sweep
weekly on ubuntu-latest runners; the trend graphs (per-op p99 over
time) are checked into bench/trends/ and reviewed monthly. Drift
detection is the boring half of benchmarking; without it, every
optimisation eventually slips back over a long enough horizon.
Closing thoughts on the series¶
We started this series wanting to know how a long-term memory
backend was actually built. Eight articles later, the answer is mostly
"it isn't." Almost every component on the way down was already in
the database for non-memory reasons — the storage engine, MVCC, WAL,
RLS, replication, TDE, the planner, the buffer pool, the catalog. The
new code is concentrated in the SQL surface (Memory.c,
Checkpoint.c, the vector type and HNSW), the C client library, and
the framework adapters. Everything underneath was already a database.
That's the part of the story that's easiest to lose in marketing copy: the "agent-memory backend" framing is a positioning choice; the underlying claim is that an honest SQL database with the right surface on top is the same thing as the agent-memory backend you'd otherwise have to glue together. The benchmarks are how we know that's not just a claim.
Where to go next¶
- The product article — the same story for a non-engineering audience.
- The repository — github.com/alptekin/evolutiondb. Issues and PRs welcome; the agent-memory tag has a few good first-issues if you want to add an adapter for a framework we haven't covered.
- ADR-002 — the decision document behind the agent-memory roadmap, including the alternatives we considered and rejected.
If you've read all eight articles and want to argue with one of them,
that's also welcome. The comment thread on the product piece is a
fine venue; opening an issue with [series] in the title works too.