Gas/Security-Bit for PQ Signatures on EVM: Dataset + Methodology

Gas per Secure Bit: a normalized benchmark for PQ signatures on EVM

Happy holidays everyone.

Following up on the AA / ERC-4337 / PQ signatures discussion in this thread:

I ended up isolating one missing piece that keeps coming up implicitly:

We don’t have a normalized unit to compare different signature schemes at different security levels on EVM.

Most comparisons use “gas per verify”, but that silently mixes:

  • different security targets (e.g., ~128-bit ECDSA vs Cat3/Cat5 PQ schemes),

  • different verification surfaces (EOA vs ERC-1271 / AA),

  • and sometimes different benchmark scopes (pure verify vs full handleOps pipelines).

That makes it hard to answer basic engineering questions like:
“Is ML-DSA-65 viable on EVM relative to Falcon, under explicit assumptions?”


What I built

A small benchmark lab + dataset with explicit provenance and explicit security denominators:

Repo: GitHub - pipavlo82/gas-per-secure-bit: Gas per secure bit benchmarking for PQ signatures and VRF.

Core idea:

gas_per_secure_bit = gas_verify / security_bits

I intentionally report two denominators, because both viewpoints are useful:

Metric A — Baseline normalization (128-bit baseline)

This answers: “What is the cost per 128-bit baseline unit?”

gas_per_128b = gas_verify / 128

This is not claiming every scheme is 128-bit secure; it’s just a budgeting/normalization tool.

Metric B — Security-equivalent bits (declared convention)

This answers: “How costly is each ‘security bit’ under a declared normalization convention?”

gas_per_sec_equiv_bit = gas_verify / security_equiv_bits

For signatures I currently use the following explicit convention:

Scheme NIST category (where applicable) security_equiv_bits
ECDSA (secp256k1) 128
ML-DSA-65 (FIPS-204, Cat 3) 3 192
Falcon-1024 (Cat 5) 5 256

I use a simple mapping Cat{1,3,5} → {128,192,256} as a declared normalization convention (open to better community conventions).

Note: security_equiv_bits is a declared normalization convention for comparability. It is not a security proof and not a NIST-provided “bits” value.

Category sources:


Provenance & reproducibility

All numbers are currently single-run gas snapshots (no averaging) with full provenance:
repo, commit, bench_name, chain_profile, and a notes field.

No hidden averaging, no “best-of-N” selection — just reproducible snapshots others can rerun.


A few key rows (baseline normalization — divide by 128)

Scheme / bench Gas gas_per_128b Notes
ECDSA ecrecover 21,126 165 classical baseline; not PQ-secure (Shor)
Falcon getUserOpHash 218,333 1,705 small AA primitive
ML-DSA-65 PreA (isolated hot-path) 1,499,354 11,714 optimized compute core
Falcon full verify 10,336,055 80,751 PQ full verify
ML-DSA-65 verify POC 68,901,612 538,294 end-to-end POC

Security-equivalent normalization (divide by security_equiv_bits)

Scheme / bench Gas security_equiv_bits gas_per_sec_equiv_bit
Falcon getUserOpHash 218,333 256 853
ML-DSA-65 PreA 1,499,354 192 7,809
Falcon full verify 10,336,055 256 40,375
ML-DSA-65 verify POC 68,901,612 192 358,863

What stood out to me:

  • ML-DSA-65 PreA lands at ~7,809 gas / security-equivalent bit (Cat3-equivalent)

  • Falcon-1024 full verify lands at ~40,375 gas / security-equivalent bit (Cat5-equivalent)

That’s roughly a 5.2× difference for those specific benches.

This is not “ML-DSA beats Falcon overall”; it’s a narrower claim:
some ML-DSA verification surfaces can be made much more EVM-friendly if you avoid recomputing heavy public structure on-chain.


What “PreA” means (why it changes the picture)

In standard ML-DSA verification, a large portion of the cost is effectively:
ExpandA + converting the public matrix into the NTT domain.

The “PreA” path isolates the hot arithmetic core (A·z − c·t₁ in the NTT domain) by accepting A_ntt precomputed, and binding it with CommitA to prevent matrix substitution.

In my harness, A_ntt is derived from the public key seed (rho) and then bound via CommitA to prevent substitution.

This is an explicit engineering design point (especially in AA contexts): move large public structure off-chain, but keep it cryptographically bound.

Rough breakdown (current harness):

  • Full compute_w with on-chain ExpandA+NTT(A): ~64.8M gas

  • Isolated matrix multiply core (PreA): ~1.5M gas

Implementation:


Why this matters for AA / ERC-7913

In AA, the unit you care about is rarely “verify one signature in isolation”.
You care about stable ABI surfaces and comparability across candidates.

ERC-7913 provides a generic verification interface.

My working assumption: if we want PQ adoption to be engineered (not guessed), we need:

  • a shared benchmark schema,

  • explicit security denominators,

  • and comparable surfaces (pure verify vs AA pipeline).


Open questions / feedback welcome

1) Hash/XOF wiring on EVM
For EVM implementations: do we want (a) strict FIPS SHAKE wiring, (b) Keccak-based non-conformant variants, or (c) dual-mode implementations with explicit labeling in the dataset?

2) Is the dual-metric approach reasonable?
Baseline normalization is useful for budgeting; security-equivalent bits are useful for honest efficiency per security unit. Any objections to reporting both?

3) PreA standardization options
What’s the least-bad approach in AA context?

  • calldata (large, but stateless),

  • storage per key,

  • precompile,

  • hybrid with CommitA binding?


Reproducibility quick start

git clone https://github.com/pipavlo82/gas-per-secure-bit
cd gas-per-secure-bit

RESET_DATA=0 MLDSA_REF="feature/mldsa-ntt-opt-phase12-erc7913-packedA" \
  ./scripts/run_vendor_mldsa.sh

RESET_DATA=0 ./scripts/run_ecdsa.sh

QA_REF=main RESET_DATA=0 ./scripts/run_vendor_quantumaccount.sh

tail -n 20 data/results.csv


Thanks for reading — I’m very open to corrections on conventions, better threat-model framing, and suggestions on which schemes/surfaces to add next.


Links

1 Like