TL;DR
- We evaluate three BAL designs—full BAL, batched I/O BAL, and parallel I/O BAL—with different trade-offs between execution throughput and BAL size.
- We examine how closely the lowest-overhead design, parallel I/O BAL, can approach the throughput of Full BAL.
- Parallel I/O BAL achieves ~10.8 GGas/s versus ~13.9 GGas/s for full BAL, providing 78% of the throughput with only 33% of the full BAL size.
Block-level access lists (BAL) enable paralism including parallel I/O and parallel execution, by explicitly encoding all accounts and storage accessed during block execution in the block space, along with their post-execution values. In our previous article, we studied the parallel execution performance of full BAL, which includes post-transaction state diffs and pre-block read keys and values. On a 16-core commodity machine, we achieved approximately 15 GGas/s pure parallel execution throughput in a mega-block setting.
However, this study omits two dominant constraints: I/O and large BAL overheads. In a non-prewarming scenario, I/O accounts for roughly 70% of total block processing time. Although BAL enables parallel disk reads, the effectiveness of BAL-enabled parallel I/O might depend on how much read information is embedded in the BAL itself. More detailed read hints may increase I/O parallelism, but they also inflate BAL size, directly impacting network bandwidth and storage costs. As a result, BAL admits several design variants, each representing a different trade-off between achievable parallism and BAL size. Based on the precision of read hints, the main designs are: full BAL, batched I/O BAL, and parallel I/O BAL.
| Name | Details | Parallel Execution | Parallel I/O | BAL Size (RLP-encoded)* |
|---|---|---|---|---|
| Full BAL | Post-transaction state diffs & pre-block read keys and values | Per-transaction | Per-hint (for verification only) | 213 kb |
| Batched I/O BAL | Post-transaction state diffs & pre-block read keys | Per-transaction | Per-hint | 110 kb |
| Parallel I/O BAL | Post-transaction state diffs | Per-transaction | Per-transaction | 71 kb (lowest, 33% of Full BAL, 64% of Batched I/O BAL) |
*Sampled from blocks #23,770,000–23,771,999
Ideally, we would like to maximize throughput while minimizing BAL size. While full BAL delivers the highest performance, it also incurs the largest overhead. This raises a key question: to what extent can the lowest-overhead design—parallel I/O BAL—approach the throughput of full BAL? Addressing this question is the central goal of this work.
To answer it, we constructed an execution environment that explicitly includes state loading via I/O reads, with the following setup:
- A flat database for accounts, storage, and contract code, as used in Reth
- Pre-recovered transaction senders, leveraging sender recovery parallelism already implemented in most clients
- Omission of state root computation and state trie commits, whose costs can be amortized for large blocks and are not the focus of this study
Using this setup, we benchmarked per-transaction parallel execution (including parallel I/O and parallel execution) with different BAL designs. The results show that parallel I/O BAL still acheives ~10.8 GGas/s on a 16-core commodity machine under mega-block setting, which is comparable to ~13.9 GGas/s with full BAL. This demonstrates that, relative to full BAL, parallel I/O BAL achieves 78% of the throughput of full BAL with only 33% of its BAL size, offering a practical trade-off between throughput and BAL size overhead.
The I/O Bottleneck in Ethereum Execution
Ethereum is continuing to scale L1. The Fusaka upgrade increased the gas limit from 45M to 60M, and Glamsterdam is expected to raise it further. Our previous research showed that BAL can improve execution throughput by an order of magnitude, providing a solid foundation for higher gas limits.
Despite these gains, I/O remains a major bottleneck in today’s block processing pipeline. In a non-prewarming setup, I/O accounts for roughly 70% of total execution time. Taking Reth as an example:
- Single-threaded execution with I/O (using MDBX) achieves only ~350 MGas/s
- With prewarming, I/O overhead drops to ~20%, and throughput improves to ~700 MGas/s
Although prewarming helps, substantial headroom remains. The fundamental limitation for I/O lies in sequential I/O access patterns. Although modern NVMe SSDs support deep I/O queues (typically up to 64), most Ethereum clients still perform state reads sequentially and fail to fully exploit the available I/O parallelism.
BAL addresses this limitation by enabling parallel I/O, but it does so at a cost. Post-transaction state diffs are essential for parallel execution—our prior work showed they enable a 10× speedup over sequential execution. However, read values and read hints can together be comparable in size to state diffs, while the performance benefit they provide relative to this additional network and storage overhead is less clear.
This raises an important design question: if near-optimal performance can be achieved without including read values—or even read hints—BAL size could be reduced significantly, lowering network and storage costs without sacrificing throughput. To test this hypothesis, we focus on parallel I/O BAL, which includes only post-transaction state diffs and performs state reads on demand during execution.
Experimental Methodology
To evaluate the ultimate performance limits enabled by parallel I/O BAL, we constructed a simplified execution environment by removing the aforementioned unrelated parts. This allows us to measure the true upper bound of BAL-powered parallelism.
Leveraging reth’s high-performance execution engine and RocksDB’s multi-threaded read capabilities, we modified the reth client to dump execution dependencies (including blocks, BALs, and the last 256 block hashes), use REVM as the EVM execution engine, and introduce a RocksDB-based state provider for account, code, and storage access.
Simplification for I/O Execution Emulation
- All transactions come with the sender already recovered (sender recovery can be fully parallelized ahead of time).
- No state root computation or trie commits are performed after execution (only flat state commits), as these costs are orthogonal to the focus of this study.
Engineering Work & Setup
- Modified the Reth client to support dumping full execution dependencies, including blocks, BALs, the last 256 block hashes.
- Added a rocksdb state provider for Revm to load account, code and storage state
- Reth’s MDBX binding was initially tested but showed degraded performance under multi-threading; RocksDB was adopted instead, with a migration tool to convert MDBX databases to RocksDB
- For parallel I/O, a shared cache layer is used to avoid redundant reads across transactions
- Dropped the page cache before each experiment
- Parallelism granularity = per-transaction
- Hardware:
- AMD Ryzen 9 5950X (16 physical cores, or 32 with hyper-threading)
- 128 GB RAM
- 7TB RAID-0 NVMe SSD (~960k random read IOPS for 4k blocks, 3.7GB/s bandwith)
- Dataset: 2000 mainnet blocks (#23770000–23771999).
- Metric: Gas per second = total gas used / execution with I/O time.
Benchmark suite available here:
GitHub - dajuguan/evm-benchmark
Results
We first evaluated Ethereum mainnet blocks under parallel I/O and parallel execution with different thread counts given parallel I/O BAL. The results reveal a clear critical path dominated by the longest-running transactions. To mitigate this, we simulated larger block gas limits, which unlock substantially more parallelism when using BAL.
With 16 threads and a 1G-gas block, parallel I/O BAL achieves a throughput of ~10.8 GGas/s, approaching 78% of the ~13.8 GGas/s achieved by full BAL. Crucially, this performance comes with an average BAL size of only ~71 KB, representing a ~67% reduction compared to full BAL.
Critical Path Analysis in Parallel I/O and Parallel Execution
To evaluate both the actual speedup and the effect of Amdahl’s law on transaction-level parallelism, we conducted per-transaction parallel execution experiments to quantify the impact of the longest-running transactions on the achievable speedup.
Detailed results are shown below (where “longest txs latency” is the total execution with I/O time of the longest-running transactions in each block):
| Threads | Throughput (MGas/s) | Longest TXs Latency | Total Time |
|---|---|---|---|
| 1 | 740 | 6.85s | 60.62s |
| 2 | 1,447 | 6.75s | 31.00s |
| 4 | 2,167 | 8.11s | 20.70s |
| 8 | 2,994 | 9.02s | 14.98s |
| 16 | 3,220 | 8.92s | 13.93s |
| 32 | 3,253 | 9.57s | 13.79s |
Overall, the results closely follow Amdahl’s law. Although throughput increases with more threads, total execution time is constrained by the longest transaction. Under 16 threads, the longest transactions account for ~75% of total execution time, limiting speedup to ~4× rather than the ideal 16×.
To overcome this limitation, we tried to increase the block gas limit.
When thread count exceeds physical cores (e.g 32 threads on 16 cores), performance no longer improves. While I/O itself can scale beyond physical cores, this is likely limited by RocksDB cache lookups (indexes, bloom filters, data blocks) and CPU-intensive value encoding/decoding.
Mega Blocks Enable Massive Parallelism
To overcome per-block critical path limits, we experimented with higher-gas “mega blocks” as in our previous work to increase parallelism. To simulate this, we executed the transactions of multiple consecutive mainnet blocks, namely a mega block or a batch, in parallel, and then committed the state to database only after all transactions in the batch had completed. This effectively aggregates multiple blocks into a single large execution unit.
We evaluated a batch of 50 blocks, simulating an average block gas usage of 1,121 M, across different thread counts. Full results are shown below:
| Threads | Throughput (MGas/s) | Longest TXs Latency | Total Time |
|---|---|---|---|
| 1 | 943 | 0.53s | 47.55s |
| 2 | 1,857 | 0.53s | 24.16s |
| 4 | 3,505 | 0.56s | 12.80s |
| 8 | 6,524 | 0.57s | 6.88s |
| 16 | 10,842 | 0.61s | 4.13s |
| 32 | 10,794 | 1.07s | 4.14s |
With mega blocks, the longest running transactions no longer dominate the critical path—they contribute less than 15% of total execution time under 16 threads. Throughput scales almost linearly with thread count, reaching ~10.8 GGas/s—78% of full BAL performance—while maintaining a 67% reduction in BAL size of full BAL.
| BAL Design | RLP-Encoded BAL Size | Throughput with 16 threads |
|---|---|---|
| Full BAL | 213 KB | 13,881 Mgas/s |
| Parallel I/O BAL | 71 KB (33% of 213 KB) | 10,842 Mgas/s |
Conclusion
This study demonstrates that parallel I/O BAL approaches the performance of full BAL while substantially reducing BAL size. In mega-block settings, parallel I/O BAL sustains approximately 10.8 GGas/s (~78% of full BAL throughput), while reducing BAL size overhead to about 33% of that of full BAL. This makes parallel I/O BAL a practical and efficient design choice, balancing throughput against network and storage overhead.
Overall, these results establish a practical upper bound for parallel I/O BAL-powered parallel execution and provide actionable insights for Ethereum client optimizations and future L1 scaling efforts.
Other works
In addition to execution benchmarks, we compared RocksDB and MDBX under synthetic random-read workloads and EVM execution, and examined the trade-offs between parallel I/O BAL and batched I/O BAL across different block gas limits.
MDBX vs. RocksDB Random Read Benchmark
We first benchmarked raw random-read performance for MDBX and RocksDB on the same hardware used in prior experiments, varying the number of reader threads to assess scalability. The database configuration was as follows:
| Item | Value |
|---|---|
| Key size | 16 bytes |
| Value size | 32 bytes |
| Entries | 1.6 billion |
| RocksDB size | 85 GB |
| MDBX size | 125 GB |
Detailed results:
| Threads | Database | IOPS | Avg Latency (µs) | CPU Usage (%) |
|---|---|---|---|---|
| 2 | RocksDB | 12K | 160 | 1.1 |
| 2 | MDBX | 21K | 85 | 0.8 |
| 4 | RocksDB | 30K | 130 | 2.2 |
| 4 | MDBX | 48K | 84 | 1.3 |
| 8 | RocksDB | 85K | 92 | 4.5 |
| 8 | MDBX | 97K | 83 | 2.5 |
| 16 | RocksDB | 180K | 90 | 8 |
| 16 | MDBX | 180K | 86 | 6 |
| 32 | RocksDB | 320K | 110 | 24 |
| 32 | MDBX | 360K | 90 | 13 |
Both RocksDB and MDBX scale throughput nearly linearly with thread count, even beyond the 16 physical cores. Once thread counts exceed 8, the differences in IOPS and latency between the two databases become minimal.
Benchmark suite availiable at: GitHub - dajuguan/ioarena: Embedded storage benchmarking tool for libmdbx, rocksdb, lmdb, etc.
MDBX vs. RocksDB EVM Execution Benchmark with Parallel I/O Setup
We then evaluated EVM execution throughput with parallel I/O using MDBX and compared it against RocksDB, under a block gas usage of 1,121 M. Detailed results:
| Threads | Database | Throughput (MGas/s) |
|---|---|---|
| 8 | MDBX | 2,369 |
| 8 | RocksDB | 6,524 |
| 16 | MDBX | 3,705 |
| 16 | RocksDB | 10,842 |
| 32 | MDBX | 5,748 |
| 48 | MDBX | 6,662 |
| 64 | MDBX | 6,525 |
Despite similar raw I/O performance, execution throughput with MDBX is significantly lower. This discrepancy is likely due to the current usage of reth’s MDBX binding, which does not fully exploit the underlying I/O parallelism. In particular, proper management of shared readers across threads could improve performance, but we have not yet found an effective approach.
Parallel I/O vs. Batched I/O across Gas Limits
The previous analysis primarily focused on parallel I/O, where state is fetched on demand during execution. However, batched I/O may offer advantages in scenarios where some transactions are highly I/O-intensive and can better exploit I/O parallelism beyond the number of physical CPU cores.
To evaluate this trade-off, we compared parallel I/O BAL and batched I/O BAL across different I/O load pattern, and measured how execution throughput scales under the two BAL designs.
Average I/O Load Analysis with Mainnet data
We begin with the average-case analysis, where storage reads account for only a fraction of the instructions executed within each transaction—a setting that closely reflects typical mainnet workloads. The following table summarizes the throughput results under different BAL designs, thread counts, and block gas usage.
| I/O Type | Threads | Block Batch Size | Avg. Block Gas (M) | Throughput (MGas/s) |
|---|---|---|---|---|
| Batched | 16 | 1 | 22 | 3,587 |
| Batched | 32 | 1 | 22 | 3,333 |
| Parallel | 16 | 1 | 22 | 2,893 |
| Batched | 16 | 10 | 224 | 7,221 |
| Batched | 32 | 10 | 224 | 6,725 |
| Parallel | 16 | 10 | 224 | 6,842 |
| Batched | 16 | 50 | 1,121 | 10,159 |
| Batched | 32 | 50 | 1,121 | 10,259 |
| Parallel | 16 | 50 | 1,121 | 10,842 |
| Batched | 16 | 100 | 2,243 | 11,129 |
| Batched | 32 | 100 | 2,243 | 11,266 |
| Parallel | 16 | 100 | 2,243 | 11,292 |
As block gas usage increases, throughput continues to increase for both designs. However, the relative advantage of batched I/O BAL decreases steadily, from roughly 20% at small block sizes to nearly zero at large block sizes.
In addition, increasing the threads beyond 16 to 32 for batched I/O BAL provides little performance benefit, indicating that the workload becomes CPU-bound rather than I/O-bound. This behavior is likely due to RocksDB cache lookups and CPU-intensive value encoding/decoding, which limit further I/O scaling.
| BAL Design | RLP-Encoded BAL Size |
|---|---|
| Batched I/O BAL (with reads) | 110 KB |
| Parallel I/O BAL (without reads) | 71 KB (35% smaller) |
Crucially, the average RLP-encoded BAL size for batched I/O is approximately 35% larger than that of parallel I/O. Given that large blocks expose excution bottlenecks beyond I/O reads alone, this additional network and storage overhead makes parallel I/O the more attractive BAL design choice overall. Detailed BAL size measurements are also available in the above benchmark suite.
Worst-Case I/O Load Analysis with Simulated Data
To complement the average-case results, we now consider the worst-case I/O load scenario, where disk reads dominate transaction execution.
To simulate this setting, we construct synthetic transactions that maximize storage access pressure. Specifically, we generate transactions whose opcode stream is filled with calls to a contract performing repeated SLOAD(x) operations, where x is the hash of a random value. Without BAL-provided read locations, such transactions must execute the SLOAD opcodes sequentially to fetch storage state, representing a worst-case I/O-bound workload.
Given the current per-transaction gas limit of 16 million gas, and a per-slot state read cost of approximately:
- 2000 gas for
SLOAD, plus ~39 gas for keccak hash overhead,
a single transaction can perform at most:\frac{16{,}000{,}000}{2039} \approx 7{,}845
distinct storage reads. Using this configuration, we simulate worst-case I/O-load transactions with the mainnet database.
The resulting performance comparison between batched and parallel I/O designs is shown below:
| I/O Type | Threads | Total Execution Time (ms) | Avg. Block Gas (M) | Throughput (MGas/s) |
|---|---|---|---|---|
| Batched | 16 | 14.4 | 64 | 4,571 |
| Batched | 32 | 11.2 | 64 | 5,818 |
| Batched | 48 | 10.7 | 64 | 6,400 |
| Batched | 64 | 10.7 | 64 | 5,333 |
| Parallel | 4 | 82.5 | 64 | 780 |
| Batched | 16 | 42.6 | 640 | 11,034 |
| Batched | 32 | 58.2 | 640 | 12,307 |
| Batched | 32 | 60.3 | 640 | 10,158 |
| Parallel | 16 | 82.2 | 640 | 7,804 |
Under a lower block gas usage (64M gas), batched I/O BAL achieves its best throughput at 48 threads, reaching nearly 8× the throughput of parallel I/O BAL. This confirms that explicit I/O batching is highly effective when storage reads dominate execution.
However, it is important to interpret these results in an end-to-end execution context. Even in the worst-case I/O load scenario, the total execution time for parallel I/O BAL remains well below the current attestation deadline (~3 seconds). Moreover, as theres are no state changes in this case, parallel execution excludes merklization and state commit costs, which together account for nearly 50% of total execution time in realistic parallel execution pipelines.
In the 10× gas-usage mega-block setting (640 M gas), the performance gap narrows further: batched I/O BAL outperforms parallel I/O BAL by only ~1.6×, while both remain comfortably within validation time constraints.
| I/O Type | Avg. Block Gas (M) | Optimal Throughput (MGas/s) | RLP-Encoded BAL Size |
|---|---|---|---|
| Batched | 64 | 6,400 | 251 kb |
| Parallel | 64 | 6,400 | 0 kb |
| Batched | 640 | 15,238 | 2,511 kb |
| Parallel | 640 | 6,153 | 0 kb |
Taken together, under worst-case I/O-heavy workloads, we observe the following:
- Under current mainnet gas limits:
- Batched-I/O BAL achieves up to an 8× throughput improvement over parallel-I/O BAL. However, when considering end-to-end block processing time, I/O reads are not the dominant bottleneck in this regime.
- Under 10× gas limits:
- The performance advantage of batched-I/O BAL narrows significantly, delivering 1.6× throughput over parallel-I/O BAL, while incurring an additional ~2.5 MiB BAL size overhead, which is non-negligible.
These results reinforce a key insight: although batched-I/O BAL delivers the best performance under pathological, I/O-saturated workloads, parallel-I/O BAL remains sufficiently robust even in worst-case scenarios—without incurring the additional BAL size overhead introduced by batching.
Benchmark suite available here:
GitHub - dajuguan/evm-benchmark