PeerDAS - 30% acceleration for 4x less memory usage

mratsim · May 27, 2026, 3:47pm

Following a grant from the Ethereum Foundation I have recently added PeerDAS support to Constantine.

With careful engineering combined with state-of-the-art research, I managed to get a 30% acceleration over c-kzg-4844 while using 4x less memory for precomputation tables (and more memory → more acceleration), this should significantly help resource-constrained devices

Benchmark	Precompute	c-kzg-4844 (serial)	constantine (serial)	Δ%
blob_to_kzg_commitment	—	29.857 ms	19.556 ms	-34.5%
compute_kzg_proof	—	31.482 ms	20.235 ms	-35.7%
compute_blob_kzg_proof	—	31.691 ms	19.858 ms	-37.3%
verify_kzg_proof	—	0.802 ms	0.568 ms	-29.1%
verify_blob_kzg_proof	—	1.196 ms	0.955 ms	-20.2%
verify_blob_kzg_proof_batch 1	—	1.203 ms	1.044 ms	-13.2%
verify_blob_kzg_proof_batch 2	—	2.017 ms	1.608 ms	-20.3%
verify_blob_kzg_proof_batch 4	—	3.600 ms	2.760 ms	-23.3%
verify_blob_kzg_proof_batch 8	—	6.637 ms	4.967 ms	-25.2%
verify_blob_kzg_proof_batch 16	—	13.056 ms	9.205 ms	-29.5%
verify_blob_kzg_proof_batch 32	—	25.704 ms	17.765 ms	-30.9%
verify_blob_kzg_proof_batch 64	—	51.174 ms	34.736 ms	-32.1%
precompute_load (L0)	—	1163.746 ms	—	—
PeerDAS (EIP-7594)
compute_cells	—	1.932 ms	1.020 ms	-47.2%
compute_cells_and_kzg_proofs	no precomp	183.384 ms	141.489 ms	-22.8%
compute_cells_and_kzg_proofs	ckzg precomp=1, 768 KiB	608.452 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=2, 1536 KiB	315.771 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=3, 3 MiB	226.319 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=4, 6 MiB	182.381 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=5, 12 MiB	156.495 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=6, 24 MiB	138.634 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=7, 48 MiB	129.472 ms	—	—
compute_cells_and_kzg_proofs	ckzg precomp=8, 96 MiB	120.608 ms	—	—
compute_cells_and_kzg_proofs	ctt t=64, b=6, 32.2 MiB	—	115.359 ms	—
compute_cells_and_kzg_proofs	ctt t=64, b=8, 96.0 MiB	—	105.269 ms	—
compute_cells_and_kzg_proofs	ctt t=64, b=10, 312.0 MiB	—	95.802 ms	—
compute_cells_and_kzg_proofs	ctt t=64, b=12, 1056.0 MiB	—	87.591 ms	—
compute_cells_and_kzg_proofs	ctt t=128, b=6, 16.5 MiB	—	114.424 ms	—
compute_cells_and_kzg_proofs	ctt t=128, b=8, 48.0 MiB	—	101.874 ms	—
compute_cells_and_kzg_proofs	ctt t=128, b=10, 156.0 MiB	—	95.423 ms	—
compute_cells_and_kzg_proofs	ctt t=128, b=12, 528.0 MiB	—	88.957 ms	—
compute_cells_and_kzg_proofs	ctt t=256, b=6, 8.2 MiB	—	117.055 ms	—
compute_cells_and_kzg_proofs	ctt t=256, b=8, 24.0 MiB	—	98.698 ms	—
compute_cells_and_kzg_proofs	ctt t=256, b=10, 84.0 MiB	—	97.307 ms	—
compute_cells_and_kzg_proofs	ctt t=256, b=12, 288.0 MiB	—	93.098 ms	—
recover_cells_and_kzg_proofs¹	see ¹	137.245 ms	97.272 ms	-29.1%
verify_cell_kzg_proof_batch²	—	439.979 ms	381.058 ms	-13.4%

Notes:

¹ Recovery: c-kzg-4844 uses precompute=8 (96 MiB); constantine uses t=256, b=8 (24 MiB)
² c-kzg-4844 verifies 8192 cells (64 blobs); constantine matches this config
Δ% shows constantine relative to c-kzg-4844 (negative = faster)
c-kzg-4844 precompute levels and constantine (t, b) configs are not directly comparable
Precompute=8 (c-kzg-4844) trades 96 MiB memory for ~34% speedup in FK20 operations

Some highlights:

For FK23 ( Fast amortized KZG proofs - Fast amortized KZG proofs), rewriting the Toeplitz matrix multiplication with an accumulator API, delaying and batching as much as possible (field inversions, inverse FFTs, scalar muls) at the end, limiting allocations and strided iteration in hot path: constantine/constantine/math/matrix/toeplitz.nim at e6bee85e8c7a89af279460e4ca03283d817d1ce9 · mratsim/constantine · GitHub (I had a perf bug that cost me 5x perf by not delaying inverse FFT and another 2x for not batching scalar multiplications [PeerDAS] Perf fix - Proof computation 9.39x acceleration by mratsim · Pull Request #616 · mratsim/constantine · GitHub)
And introducing precomputed MSMs like c-kzg-4844 but instead of the approach from BLST: blst/src/ec_mult.h at e7f90de551e8df682f3cc99067d204d8b90d27ad · supranational/blst · GitHub / c-kzg-4844/README.md at 9f4bcc83cbb17b3dbc3432de7320790968143ab9 · ethereum/c-kzg-4844 · GitHub
```
static void ptype##_precompute_w##SZ(ptype row[], const ptype *point) \
{ \
    size_t i, j; \
                                      /* row[-1] is implicit infinity */\
    vec_copy(&row[0], point, sizeof(ptype));        /* row[0]=p*1     */\
    ptype##_double(&row[1],  point);                /* row[1]=p*(1+1) */\
    for (i = 2, j = 1; i < 1<<(SZ-1); i += 2, j++) \
        ptype##_add(&row[i], &row[j], &row[j-1]),   /* row[2]=p*(2+1) */\
        ptype##_double(&row[i+1], &row[j]);         /* row[3]=p*(2+2) */\
}                                                   /* row[4] ...     */\
```
The precomputation table follows the techniques described by Gottfried Herold (Notes on MSMs with Precomputation - HackMD) and Ignacio Hagopian ( Verkle Trees - Another iteration of VKTs MSM - HackMD ) → constantine/constantine/math/elliptic/ec_multi_scalar_mul_precomp.nim at e6bee85e8c7a89af279460e4ca03283d817d1ce9 · mratsim/constantine · GitHub to get an incredible 4x reduction in memory while keeping and even improving over the 22.% advantage Constantine had without precompute tables

On multithreading

Contrary to EIP-4844, I haven’t parallelized PeerDAS yet but Constantine parallel backend is highly tuned (see vs go-kzg-4844 here Releasing Constantine v0.2.0 (Jan 2025), a modular cryptography stack for Ethereum 22% speedup) and can do nested parallelism with low overhead, in fact my MSM has 3 level of parallelism and the bottleneck of PeerDAS is an embarassingly parallel for loop over 128 precomputed MSMs: constantine/constantine/math/matrix/toeplitz.nim at e6bee85e8c7a89af279460e4ca03283d817d1ce9 · mratsim/constantine · GitHub

proc finish*[EC, ECaff, F; N: static int](
  ctx: var ToeplitzAccumulator[EC, ECaff, F],
  output: var openArray[EC],
  polyphaseSpectrumBank: openArray[PrecomputedMSM[EC, N]]
): ToeplitzStatus {.raises: [], meter.} =
  ## Finalize using precomputed MSM tables (one per output position).
  ## For each output position `i`, extracts the `L` scalars from `coeffs`
  ## and computes `output[i]` using `polyphaseSpectrumBank[i].msm_vartime`.
  ## After all MSMs, an in-place EC IFFT is applied to `output`.
  let n = ctx.size
  if n == 0 or output.len != n or ctx.offset != ctx.L or polyphaseSpectrumBank.len != n or N != ctx.L:
    return Toeplitz_MismatchedSizes
  let scalars = cast[ptr UncheckedArray[F.getBigInt()]](ctx.scratchScalars)
  for i in 0 ..< n:
    for offset in 0 ..< ctx.L:
      scalars[offset].fromField(ctx.coeffs[i * ctx.L + offset])
    polyphaseSpectrumBank[i].msm_vartime(output[i], scalars.toOpenArray(ctx.L))
  checkReturn ec_ifft_nn(ctx.ecFftDesc, output, output)
  return Toeplitz_Success

Hence assuming a mini-PC with 8 cores Zen 4 (Ryzen 7840HS is quite popular in laptops and mini-PCs), I’m quite confidence we can reach below 15 ms for proving 64 blobs.

Related PRs:

Initial structure: https://github.com/mratsim/constantine/pull/613
C, Go, Rust API: https://github.com/mratsim/constantine/pull/617
Final perf tuning with benches: https://github.com/mratsim/constantine/pull/619