PeerDAS - 30% acceleration for 4x less memory usage

Following a grant from the Ethereum Foundation I have recently added PeerDAS support to Constantine.

With careful engineering combined with state-of-the-art research, I managed to get a 30% acceleration over c-kzg-4844 while using 4x less memory for precomputation tables (and more memory → more acceleration), this should significantly help resource-constrained devices

Benchmark Precompute c-kzg-4844 (serial) constantine (serial) Δ%
blob_to_kzg_commitment 29.857 ms 19.556 ms -34.5%
compute_kzg_proof 31.482 ms 20.235 ms -35.7%
compute_blob_kzg_proof 31.691 ms 19.858 ms -37.3%
verify_kzg_proof 0.802 ms 0.568 ms -29.1%
verify_blob_kzg_proof 1.196 ms 0.955 ms -20.2%
verify_blob_kzg_proof_batch 1 1.203 ms 1.044 ms -13.2%
verify_blob_kzg_proof_batch 2 2.017 ms 1.608 ms -20.3%
verify_blob_kzg_proof_batch 4 3.600 ms 2.760 ms -23.3%
verify_blob_kzg_proof_batch 8 6.637 ms 4.967 ms -25.2%
verify_blob_kzg_proof_batch 16 13.056 ms 9.205 ms -29.5%
verify_blob_kzg_proof_batch 32 25.704 ms 17.765 ms -30.9%
verify_blob_kzg_proof_batch 64 51.174 ms 34.736 ms -32.1%
precompute_load (L0) 1163.746 ms
PeerDAS (EIP-7594)
compute_cells 1.932 ms 1.020 ms -47.2%
compute_cells_and_kzg_proofs no precomp 183.384 ms 141.489 ms -22.8%
compute_cells_and_kzg_proofs ckzg precomp=1, 768 KiB 608.452 ms
compute_cells_and_kzg_proofs ckzg precomp=2, 1536 KiB 315.771 ms
compute_cells_and_kzg_proofs ckzg precomp=3, 3 MiB 226.319 ms
compute_cells_and_kzg_proofs ckzg precomp=4, 6 MiB 182.381 ms
compute_cells_and_kzg_proofs ckzg precomp=5, 12 MiB 156.495 ms
compute_cells_and_kzg_proofs ckzg precomp=6, 24 MiB 138.634 ms
compute_cells_and_kzg_proofs ckzg precomp=7, 48 MiB 129.472 ms
compute_cells_and_kzg_proofs ckzg precomp=8, 96 MiB 120.608 ms
compute_cells_and_kzg_proofs ctt t=64, b=6, 32.2 MiB 115.359 ms
compute_cells_and_kzg_proofs ctt t=64, b=8, 96.0 MiB 105.269 ms
compute_cells_and_kzg_proofs ctt t=64, b=10, 312.0 MiB 95.802 ms
compute_cells_and_kzg_proofs ctt t=64, b=12, 1056.0 MiB 87.591 ms
compute_cells_and_kzg_proofs ctt t=128, b=6, 16.5 MiB 114.424 ms
compute_cells_and_kzg_proofs ctt t=128, b=8, 48.0 MiB 101.874 ms
compute_cells_and_kzg_proofs ctt t=128, b=10, 156.0 MiB 95.423 ms
compute_cells_and_kzg_proofs ctt t=128, b=12, 528.0 MiB 88.957 ms
compute_cells_and_kzg_proofs ctt t=256, b=6, 8.2 MiB 117.055 ms
compute_cells_and_kzg_proofs ctt t=256, b=8, 24.0 MiB 98.698 ms
compute_cells_and_kzg_proofs ctt t=256, b=10, 84.0 MiB 97.307 ms
compute_cells_and_kzg_proofs ctt t=256, b=12, 288.0 MiB 93.098 ms
recover_cells_and_kzg_proofs¹ see ¹ 137.245 ms 97.272 ms -29.1%
verify_cell_kzg_proof_batch² 439.979 ms 381.058 ms -13.4%

Notes:

  • ¹ Recovery: c-kzg-4844 uses precompute=8 (96 MiB); constantine uses t=256, b=8 (24 MiB)
  • ² c-kzg-4844 verifies 8192 cells (64 blobs); constantine matches this config
  • Δ% shows constantine relative to c-kzg-4844 (negative = faster)
  • c-kzg-4844 precompute levels and constantine (t, b) configs are not directly comparable
  • Precompute=8 (c-kzg-4844) trades 96 MiB memory for ~34% speedup in FK20 operations

Some highlights:

On multithreading

Contrary to EIP-4844, I haven’t parallelized PeerDAS yet but Constantine parallel backend is highly tuned (see vs go-kzg-4844 here Releasing Constantine v0.2.0 (Jan 2025), a modular cryptography stack for Ethereum 22% speedup) and can do nested parallelism with low overhead, in fact my MSM has 3 level of parallelism and the bottleneck of PeerDAS is an embarassingly parallel for loop over 128 precomputed MSMs: constantine/constantine/math/matrix/toeplitz.nim at e6bee85e8c7a89af279460e4ca03283d817d1ce9 · mratsim/constantine · GitHub

proc finish*[EC, ECaff, F; N: static int](
  ctx: var ToeplitzAccumulator[EC, ECaff, F],
  output: var openArray[EC],
  polyphaseSpectrumBank: openArray[PrecomputedMSM[EC, N]]
): ToeplitzStatus {.raises: [], meter.} =
  ## Finalize using precomputed MSM tables (one per output position).
  ## For each output position `i`, extracts the `L` scalars from `coeffs`
  ## and computes `output[i]` using `polyphaseSpectrumBank[i].msm_vartime`.
  ## After all MSMs, an in-place EC IFFT is applied to `output`.
  let n = ctx.size
  if n == 0 or output.len != n or ctx.offset != ctx.L or polyphaseSpectrumBank.len != n or N != ctx.L:
    return Toeplitz_MismatchedSizes
  let scalars = cast[ptr UncheckedArray[F.getBigInt()]](ctx.scratchScalars)
  for i in 0 ..< n:
    for offset in 0 ..< ctx.L:
      scalars[offset].fromField(ctx.coeffs[i * ctx.L + offset])
    polyphaseSpectrumBank[i].msm_vartime(output[i], scalars.toOpenArray(ctx.L))
  checkReturn ec_ifft_nn(ctx.ecFftDesc, output, output)
  return Toeplitz_Success

Hence assuming a mini-PC with 8 cores Zen 4 (Ryzen 7840HS is quite popular in laptops and mini-PCs), I’m quite confidence we can reach below 15 ms for proving 64 blobs.


Related PRs:

1 Like