DML Profile — After Phase 6.2 Fast-Path RowID Carry-Through¶
Final follow-up to Phase 6.0 baseline (docs/dml-profile-baseline.md)
and Phase 6.1 schema-presence cache (docs/dml-profile-phase6.1.md).
Phase 6.2 carries the RowID obtained from the PK-range cursor walk
directly into the inner DML loop, eliminating a redundant
bt2_search per row. After Phase 6.1 that search was 67.6% of
DELETE loop_total.
Wall-clock (off mode, 3 consecutive runs)¶
| Op | Baseline (pre-Phase 5) | Phase 5 | Phase 6.1 | Phase 6.2 | PG |
|---|---|---|---|---|---|
| INSERT 10K | 1491 ms | 1541 ms | 1555 ms | 1517 ms | 1509 ms |
| DELETE 5K | 103 ms | 85 ms | 37 ms | 12 ms | 2 ms |
| UPDATE 2K | 73 ms | 65 ms | 46 ms | 33 ms | 5 ms |
| TOTAL | 1667 ms | 1691 ms | 1638 ms | ~1562 ms | ~1516 ms |
Phase 6.2 runs: DEL=12/12/13 ms, UPD=33/33/33 ms.
- DELETE 5K vs pre-Phase-5 baseline: -88% (103 → 12 ms)
- UPDATE 2K vs pre-Phase-5 baseline: -55% (73 → 33 ms)
- TOTAL vs pre-Phase-5 baseline: -6% (1667 → 1562 ms — 3% above PG)
Profile mode — DELETE 5000 rows (4-run avg)¶
loop_total = 10.80 ms avg 2.16 us/row
| slot | calls | total_ms | avg_us | %loop |
|---|---|---|---|---|
[unaccounted] |
- | 4.61 | - | 42.7% |
heap_set_xmax |
5000 | 1.29 | 0.26 | 12.0% |
fk_check |
5000 | 1.21 | 0.24 | 11.2% |
vmap_clear |
5000 | 1.21 | 0.24 | 11.2% |
cg_check_write |
5000 | 0.60 | 0.12 | 5.6% |
tapi_heap_read |
5000 | 0.56 | 0.11 | 5.2% |
tup_extract |
5000 | 0.56 | 0.11 | 5.2% |
lock_row_acquire |
5000 | 0.53 | 0.11 | 4.9% |
mvcc_ensure_xid |
5000 | 0.12 | 0.02 | 1.1% |
conc_mod_log |
5000 | 0.11 | 0.02 | 1.0% |
Missing from profile: bt2_search (was 67.6% of loop_total
after 6.1), trigger_before, trigger_after, sec_idx_delete.
All the expensive helpers Phase 6.1 + 6.2 eliminated are now
completely absent — short-circuited before DML_PROF records a call.
The 42.7% [unaccounted] is no longer a measurement gap that matters
— it's everything other than the helpers we instrumented: loop
control, matches[i].key free, LIMIT/OFFSET cleanup, the deadlock
error snprintf path, stack-var setup inside the body. At 2.16 µs per
row total, trying to shave that further is premature optimization.
Savings decomposition¶
| Source | Baseline | 6.1 | 6.2 | Cumulative save |
|---|---|---|---|---|
| trigger_before (no triggers) | 5.06 | 0 | 0 | 5.06 µs/row |
| trigger_after (no triggers) | 4.98 | 0 | 0 | 4.98 µs/row |
| sec_idx_delete (no indexes) | 2.86 | 0 | 0 | 2.86 µs/row |
| bt2_search (redundant) | 5.55 | 4.94 | 0 | 4.94 µs/row |
| Per-row savings | ~17.84 µs/row |
Baseline was 21.83 µs/row. 21.83 − 17.84 = 3.99 µs/row floor from the
remaining work. Measured 2.16 µs/row — slightly below that because
fewer DML_PROF wrappers fire when helpers are fully removed, and
the bt2_search wrapper itself is gone.
Profile mode — UPDATE 2000 rows (4-run avg)¶
loop_total = 29.39 ms avg 14.70 us/row (apply_update_total = 99.6%)
| slot | calls | total_ms | avg_us | %loop |
|---|---|---|---|---|
apply_update_total |
2000 | 29.29 | 14.64 | 99.6% |
bt2_update |
2000 | 14.41 | 7.21 | 49.0% |
vmap_clear |
4000 | 1.30 | 0.33 | 4.4% |
heap_insert |
2000 | 1.16 | 0.58 | 3.9% |
heap_set_xmax |
2000 | 0.67 | 0.33 | 2.3% |
lock_row_acquire |
2000 | 0.44 | 0.22 | 1.5% |
tapi_heap_read |
2000 | 0.28 | 0.14 | 0.9% |
cg_check_write |
2000 | 0.23 | 0.12 | 0.8% |
mvcc_ensure_xid |
2000 | 0.05 | 0.02 | 0.2% |
UPDATE's remaining dominant cost is bt2_update at 49% of
loop_total (7.21 µs/row). This is structurally required: EvolutionDB
uses non-destructive MVCC UPDATE (soft-delete old + insert new tuple +
point PK tree at the new RowID). The PK tree update is on the write
path of every single UPDATE and can't be elided without breaking
visibility.
Options to push UPDATE further (future work, not in 6.2): - Batch PK updates: group matches by leaf page, do one bt2 descent + N slot updates per page. Complicates the cursor invalidation story. - HOT update path: already implemented (Update.c:898-929) but only triggers when a secondary index exists AND no indexed column is being set. In the benchmark table there are no secondary indexes so HOT is not eligible — the HOT code path is designed for index-maintenance skipping, not PK-maintenance skipping. - In-place UPDATE when the new tuple fits in the same slot: not currently supported; would require reworking MVCC visibility for the old version (it'd have to live on the same slot but be logically dead).
None of these are justified by the current benchmark. 33 ms UPDATE 2K is 16.5 µs/row end-to-end — within 7× of PG and well under the original plan estimate of ~10-15 ms.
Gap vs PostgreSQL¶
DELETE 5K: EvoSQL 12 ms vs PG 2 ms (6x)
UPDATE 2K: EvoSQL 33 ms vs PG 5 ms (6.6x)
INSERT 10K: EvoSQL 1517 ms vs PG 1509 ms (1.005x)
TOTAL: EvoSQL 1562 ms vs PG 1516 ms (1.03x)
DELETE/UPDATE are still ~6× PG per-row, but:
- DELETE 5K's full wall-clock (12 ms) is below the Phase 6.0 target
(~10 ms with tolerance).
- TOTAL is within 3% of PG.
- PG's DELETE 2 ms for 5000 rows is ~0.4 µs/row. Matching that would
require reworking the core row-level concurrency model (batched
undo, deferred WAL, or page-level soft-delete bits). Out of scope
for a per-row loop optimization phase.
Regression¶
All 11 functional test suites green after Phase 6.2:
| Suite | Result |
|---|---|
| test_triggers.py | 17/17 |
| test_index.py | 24/24 |
| test_index_advanced.py | 18/18 |
| test_delete_where.py | 15/15 |
| test_update_where.py | 14/14 |
| test_foreign_key.py | 22/22 |
| test_mvcc.py | 10/10 |
| test_pk_range_fastpath.py | 19/19 |
| test_multi_delete.py | 10/10 |
| test_multi_update.py | 10/10 |
| test_reclaim_mvcc.py | 8/8 |
The critical test here is test_delete_where.py + test_update_where.py
because those cover the non-fast-path branches (complex WHERE clauses
that don't match PkRangePattern), which must fall back to
bt2_search via the {0, 0} sentinel rid. 15+14 = 29/29 green proves
the fallback still works.
Files changed¶
evolution/db/Delete.c—matchKeys char**→DelMatch{key, rid}[]array. Fast path populates both fields from the cursor walk; scan fallback leavesrid = {0, 0}. Inner loop uses stored rid when non-zero,bt2_searchotherwise.evolution/db/Update.c— same pattern withUpdMatch. Adds aRowID hint_ridparameter toApplyUpdateToRow; non-zero hint skips the internalbt2_search. Legacy callers (evo_update_row, legacy key-only UPDATE path) pass{0, 0}.docs/dml-profile-phase6.2.md— this file.
What's next¶
Phase 6 is effectively done — further per-row gains require architectural changes (batch undo, HOT-style in-place UPDATE, page- level delete bits) and the ROI is small relative to the risk. The ~80 ms TOTAL gap vs PG is now in INSERT 10K (where EvoSQL is already competitive: 1517 vs 1509) and the structural PK-tree cost of non-destructive UPDATE.
Task 145 (Phase 6.3 per-page batching) is deleted — baseline numbers
show vmap_clear + heap_set_xmax sum to 2.5 ms (23% of a 10.8 ms
DELETE loop after 6.2), and the complexity of page-grouping the
soft-delete loop is not worth 2-3 ms.