Reducing BAL Size with 10 GigaGas/s EVM Throughput in the Presence of I/O

Po · January 22, 2026, 11:30am

TL;DR

We evaluate three BAL designs—full BAL, batched I/O BAL, and parallel I/O BAL—with different trade-offs between execution throughput and BAL size.
We examine how closely the lowest-overhead design, parallel I/O BAL, can approach the throughput of Full BAL.
Parallel I/O BAL achieves ~10.8 GGas/s versus ~13.9 GGas/s for full BAL, providing 78% of the throughput with only 33% of the full BAL size.

Block-level access lists (BAL) enable paralism including parallel I/O and parallel execution, by explicitly encoding all accounts and storage accessed during block execution in the block space, along with their post-execution values. In our previous article, we studied the parallel execution performance of full BAL, which includes post-transaction state diffs and pre-block read keys and values. On a 16-core commodity machine, we achieved approximately 15 GGas/s pure parallel execution throughput in a mega-block setting.

However, this study omits two dominant constraints: I/O and large BAL overheads. In a non-prewarming scenario, I/O accounts for roughly 70% of total block processing time. Although BAL enables parallel disk reads, the effectiveness of BAL-enabled parallel I/O might depend on how much read information is embedded in the BAL itself. More detailed read hints may increase I/O parallelism, but they also inflate BAL size, directly impacting network bandwidth and storage costs. As a result, BAL admits several design variants, each representing a different trade-off between achievable parallism and BAL size. Based on the precision of read hints, the main designs are: full BAL, batched I/O BAL, and parallel I/O BAL.

Name	Details	Parallel Execution	Parallel I/O	BAL Size (RLP-encoded)*
Full BAL	Post-transaction state diffs & pre-block read keys and values	Per-transaction	Per-hint (for verification only)	213 kb
Batched I/O BAL	Post-transaction state diffs & pre-block read keys	Per-transaction	Per-hint	110 kb
Parallel I/O BAL	Post-transaction state diffs	Per-transaction	Per-transaction	71 kb (lowest, 33% of Full BAL, 64% of Batched I/O BAL)

*Sampled from blocks #23,770,000–23,771,999

Ideally, we would like to maximize throughput while minimizing BAL size. While full BAL delivers the highest performance, it also incurs the largest overhead. This raises a key question: to what extent can the lowest-overhead design—parallel I/O BAL—approach the throughput of full BAL? Addressing this question is the central goal of this work.

To answer it, we constructed an execution environment that explicitly includes state loading via I/O reads, with the following setup:

A flat database for accounts, storage, and contract code, as used in Reth
Pre-recovered transaction senders, leveraging sender recovery parallelism already implemented in most clients
Omission of state root computation and state trie commits, whose costs can be amortized for large blocks and are not the focus of this study

Using this setup, we benchmarked per-transaction parallel execution (including parallel I/O and parallel execution) with different BAL designs. The results show that parallel I/O BAL still acheives ~10.8 GGas/s on a 16-core commodity machine under mega-block setting, which is comparable to ~13.9 GGas/s with full BAL. This demonstrates that, relative to full BAL, parallel I/O BAL achieves 78% of the throughput of full BAL with only 33% of its BAL size, offering a practical trade-off between throughput and BAL size overhead.

The I/O Bottleneck in Ethereum Execution

Ethereum is continuing to scale L1. The Fusaka upgrade increased the gas limit from 45M to 60M, and Glamsterdam is expected to raise it further. Our previous research showed that BAL can improve execution throughput by an order of magnitude, providing a solid foundation for higher gas limits.

Despite these gains, I/O remains a major bottleneck in today’s block processing pipeline. In a non-prewarming setup, I/O accounts for roughly 70% of total execution time. Taking Reth as an example:

Single-threaded execution with I/O (using MDBX) achieves only ~350 MGas/s
With prewarming, I/O overhead drops to ~20%, and throughput improves to ~700 MGas/s

Although prewarming helps, substantial headroom remains. The fundamental limitation for I/O lies in sequential I/O access patterns. Although modern NVMe SSDs support deep I/O queues (typically up to 64), most Ethereum clients still perform state reads sequentially and fail to fully exploit the available I/O parallelism.

BAL addresses this limitation by enabling parallel I/O, but it does so at a cost. Post-transaction state diffs are essential for parallel execution—our prior work showed they enable a 10× speedup over sequential execution. However, read values and read hints can together be comparable in size to state diffs, while the performance benefit they provide relative to this additional network and storage overhead is less clear.

This raises an important design question: if near-optimal performance can be achieved without including read values—or even read hints—BAL size could be reduced significantly, lowering network and storage costs without sacrificing throughput. To test this hypothesis, we focus on parallel I/O BAL, which includes only post-transaction state diffs and performs state reads on demand during execution.

Experimental Methodology

To evaluate the ultimate performance limits enabled by parallel I/O BAL, we constructed a simplified execution environment by removing the aforementioned unrelated parts. This allows us to measure the true upper bound of BAL-powered parallelism.

Leveraging reth’s high-performance execution engine and RocksDB’s multi-threaded read capabilities, we modified the reth client to dump execution dependencies (including blocks, BALs, and the last 256 block hashes), use REVM as the EVM execution engine, and introduce a RocksDB-based state provider for account, code, and storage access.

Simplification for I/O Execution Emulation

All transactions come with the sender already recovered (sender recovery can be fully parallelized ahead of time).
No state root computation or trie commits are performed after execution (only flat state commits), as these costs are orthogonal to the focus of this study.

Engineering Work & Setup

Modified the Reth client to support dumping full execution dependencies, including blocks, BALs, the last 256 block hashes.
Added a rocksdb state provider for Revm to load account, code and storage state
- Reth’s MDBX binding was initially tested but showed degraded performance under multi-threading; RocksDB was adopted instead, with a migration tool to convert MDBX databases to RocksDB
- For parallel I/O, a shared cache layer is used to avoid redundant reads across transactions
Dropped the page cache before each experiment
Parallelism granularity = per-transaction
Hardware:
- AMD Ryzen 9 5950X (16 physical cores, or 32 with hyper-threading)
- 128 GB RAM
- 7TB RAID-0 NVMe SSD (~960k random read IOPS for 4k blocks, 3.7GB/s bandwith)
Dataset: 2000 mainnet blocks (#23770000–23771999).
Metric: Gas per second = total gas used / execution with I/O time.

Benchmark suite available here:
GitHub - dajuguan/evm-benchmark

Results

We first evaluated Ethereum mainnet blocks under parallel I/O and parallel execution with different thread counts given parallel I/O BAL. The results reveal a clear critical path dominated by the longest-running transactions. To mitigate this, we simulated larger block gas limits, which unlock substantially more parallelism when using BAL.

With 16 threads and a 1G-gas block, parallel I/O BAL achieves a throughput of ~10.8 GGas/s, approaching 78% of the ~13.8 GGas/s achieved by full BAL. Crucially, this performance comes with an average BAL size of only ~71 KB, representing a ~67% reduction compared to full BAL.

Critical Path Analysis in Parallel I/O and Parallel Execution

To evaluate both the actual speedup and the effect of Amdahl’s law on transaction-level parallelism, we conducted per-transaction parallel execution experiments to quantify the impact of the longest-running transactions on the achievable speedup.

Detailed results are shown below (where “longest txs latency” is the total execution with I/O time of the longest-running transactions in each block):

Threads	Throughput (MGas/s)	Longest TXs Latency	Total Time
1	740	6.85s	60.62s
2	1,447	6.75s	31.00s
4	2,167	8.11s	20.70s
8	2,994	9.02s	14.98s
16	3,220	8.92s	13.93s
32	3,253	9.57s	13.79s

Overall, the results closely follow Amdahl’s law. Although throughput increases with more threads, total execution time is constrained by the longest transaction. Under 16 threads, the longest transactions account for ~75% of total execution time, limiting speedup to ~4× rather than the ideal 16×.

To overcome this limitation, we tried to increase the block gas limit.

When thread count exceeds physical cores (e.g 32 threads on 16 cores), performance no longer improves. While I/O itself can scale beyond physical cores, this is likely limited by RocksDB cache lookups (indexes, bloom filters, data blocks) and CPU-intensive value encoding/decoding.

Mega Blocks Enable Massive Parallelism

To overcome per-block critical path limits, we experimented with higher-gas “mega blocks” as in our previous work to increase parallelism. To simulate this, we executed the transactions of multiple consecutive mainnet blocks, namely a mega block or a batch, in parallel, and then committed the state to database only after all transactions in the batch had completed. This effectively aggregates multiple blocks into a single large execution unit.

We evaluated a batch of 50 blocks, simulating an average block gas usage of 1,121 M, across different thread counts. Full results are shown below:

Threads	Throughput (MGas/s)	Longest TXs Latency	Total Time
1	943	0.53s	47.55s
2	1,857	0.53s	24.16s
4	3,505	0.56s	12.80s
8	6,524	0.57s	6.88s
16	10,842	0.61s	4.13s
32	10,794	1.07s	4.14s

With mega blocks, the longest running transactions no longer dominate the critical path—they contribute less than 15% of total execution time under 16 threads. Throughput scales almost linearly with thread count, reaching ~10.8 GGas/s—78% of full BAL performance—while maintaining a 67% reduction in BAL size of full BAL.

BAL Design	RLP-Encoded BAL Size	Throughput with 16 threads
Full BAL	213 KB	13,881 Mgas/s
Parallel I/O BAL	71 KB (33% of 213 KB)	10,842 Mgas/s

Conclusion

This study demonstrates that parallel I/O BAL approaches the performance of full BAL while substantially reducing BAL size. In mega-block settings, parallel I/O BAL sustains approximately 10.8 GGas/s (~78% of full BAL throughput), while reducing BAL size overhead to about 33% of that of full BAL. This makes parallel I/O BAL a practical and efficient design choice, balancing throughput against network and storage overhead.

Overall, these results establish a practical upper bound for parallel I/O BAL-powered parallel execution and provide actionable insights for Ethereum client optimizations and future L1 scaling efforts.

Other works

In addition to execution benchmarks, we compared RocksDB and MDBX under synthetic random-read workloads and EVM execution, and examined the trade-offs between parallel I/O BAL and batched I/O BAL across different block gas limits.

MDBX vs. RocksDB Random Read Benchmark

We first benchmarked raw random-read performance for MDBX and RocksDB on the same hardware used in prior experiments, varying the number of reader threads to assess scalability. The database configuration was as follows:

Item	Value
Key size	16 bytes
Value size	32 bytes
Entries	1.6 billion
RocksDB size	85 GB
MDBX size	125 GB

Detailed results:

Threads	Database	IOPS	Avg Latency (µs)	CPU Usage (%)
2	RocksDB	12K	160	1.1
2	MDBX	21K	85	0.8
4	RocksDB	30K	130	2.2
4	MDBX	48K	84	1.3
8	RocksDB	85K	92	4.5
8	MDBX	97K	83	2.5
16	RocksDB	180K	90	8
16	MDBX	180K	86	6
32	RocksDB	320K	110	24
32	MDBX	360K	90	13

Both RocksDB and MDBX scale throughput nearly linearly with thread count, even beyond the 16 physical cores. Once thread counts exceed 8, the differences in IOPS and latency between the two databases become minimal.

Benchmark suite availiable at: GitHub - dajuguan/ioarena: Embedded storage benchmarking tool for libmdbx, rocksdb, lmdb, etc.

MDBX vs. RocksDB EVM Execution Benchmark with Parallel I/O Setup

We then evaluated EVM execution throughput with parallel I/O using MDBX and compared it against RocksDB, under a block gas usage of 1,121 M. Detailed results:

Threads	Database	Throughput (MGas/s)
8	MDBX	2,369
8	RocksDB	6,524
16	MDBX	3,705
16	RocksDB	10,842
32	MDBX	5,748
48	MDBX	6,662
64	MDBX	6,525

Despite similar raw I/O performance, execution throughput with MDBX is significantly lower. This discrepancy is likely due to the current usage of reth’s MDBX binding, which does not fully exploit the underlying I/O parallelism. In particular, proper management of shared readers across threads could improve performance, but we have not yet found an effective approach.

Parallel I/O vs. Batched I/O across Gas Limits

The previous analysis primarily focused on parallel I/O, where state is fetched on demand during execution. However, batched I/O may offer advantages in scenarios where some transactions are highly I/O-intensive and can better exploit I/O parallelism beyond the number of physical CPU cores.

To evaluate this trade-off, we compared parallel I/O BAL and batched I/O BAL across different I/O load pattern, and measured how execution throughput scales under the two BAL designs.

Average I/O Load Analysis with Mainnet data

We begin with the average-case analysis, where storage reads account for only a fraction of the instructions executed within each transaction—a setting that closely reflects typical mainnet workloads. The following table summarizes the throughput results under different BAL designs, thread counts, and block gas usage.

I/O Type	Threads	Block Batch Size	Avg. Block Gas (M)	Throughput (MGas/s)
Batched	16	1	22	3,587
Batched	32	1	22	3,333
Parallel	16	1	22	2,893
Batched	16	10	224	7,221
Batched	32	10	224	6,725
Parallel	16	10	224	6,842
Batched	16	50	1,121	10,159
Batched	32	50	1,121	10,259
Parallel	16	50	1,121	10,842
Batched	16	100	2,243	11,129
Batched	32	100	2,243	11,266
Parallel	16	100	2,243	11,292

As block gas usage increases, throughput continues to increase for both designs. However, the relative advantage of batched I/O BAL decreases steadily, from roughly 20% at small block sizes to nearly zero at large block sizes.

In addition, increasing the threads beyond 16 to 32 for batched I/O BAL provides little performance benefit, indicating that the workload becomes CPU-bound rather than I/O-bound. This behavior is likely due to RocksDB cache lookups and CPU-intensive value encoding/decoding, which limit further I/O scaling.

BAL Design	RLP-Encoded BAL Size
Batched I/O BAL (with reads)	110 KB
Parallel I/O BAL (without reads)	71 KB (35% smaller)

Crucially, the average RLP-encoded BAL size for batched I/O is approximately 35% larger than that of parallel I/O. Given that large blocks expose excution bottlenecks beyond I/O reads alone, this additional network and storage overhead makes parallel I/O the more attractive BAL design choice overall. Detailed BAL size measurements are also available in the above benchmark suite.

Worst-Case I/O Load Analysis with Simulated Data

To complement the average-case results, we now consider the worst-case I/O load scenario, where disk reads dominate transaction execution.

To simulate this setting, we construct synthetic transactions that maximize storage access pressure. Specifically, we generate transactions whose opcode stream is filled with calls to a contract performing repeated SLOAD(x) operations, where x is the hash of a random value. Without BAL-provided read locations, such transactions must execute the SLOAD opcodes sequentially to fetch storage state, representing a worst-case I/O-bound workload.

Given the current per-transaction gas limit of 16 million gas, and a per-slot state read cost of approximately:

2000 gas for SLOAD, plus ~39 gas for keccak hash overhead,

a single transaction can perform at most:\frac{16{,}000{,}000}{2039} \approx 7{,}845

distinct storage reads. Using this configuration, we simulate worst-case I/O-load transactions with the mainnet database.

The resulting performance comparison between batched and parallel I/O designs is shown below:

I/O Type	Threads	Total Execution Time (ms)	Avg. Block Gas (M)	Throughput (MGas/s)
Batched	16	14.4	64	4,571
Batched	32	11.2	64	5,818
Batched	48	10.7	64	6,400
Batched	64	10.7	64	5,333
Parallel	4	82.5	64	780
Batched	16	42.6	640	11,034
Batched	32	58.2	640	12,307
Batched	32	60.3	640	10,158
Parallel	16	82.2	640	7,804

Under a lower block gas usage (64M gas), batched I/O BAL achieves its best throughput at 48 threads, reaching nearly 8× the throughput of parallel I/O BAL. This confirms that explicit I/O batching is highly effective when storage reads dominate execution.

However, it is important to interpret these results in an end-to-end execution context. Even in the worst-case I/O load scenario, the total execution time for parallel I/O BAL remains well below the current attestation deadline (~3 seconds). Moreover, as theres are no state changes in this case, parallel execution excludes merklization and state commit costs, which together account for nearly 50% of total execution time in realistic parallel execution pipelines.

In the 10× gas-usage mega-block setting (640 M gas), the performance gap narrows further: batched I/O BAL outperforms parallel I/O BAL by only ~1.6×, while both remain comfortably within validation time constraints.

I/O Type	Avg. Block Gas (M)	Optimal Throughput (MGas/s)	RLP-Encoded BAL Size
Batched	64	6,400	251 kb
Parallel	64	6,400	0 kb
Batched	640	15,238	2,511 kb
Parallel	640	6,153	0 kb

Taken together, under worst-case I/O-heavy workloads, we observe the following:

Under current mainnet gas limits:
- Batched-I/O BAL achieves up to an 8× throughput improvement over parallel-I/O BAL. However, when considering end-to-end block processing time, I/O reads are not the dominant bottleneck in this regime.
Under 10× gas limits:
- The performance advantage of batched-I/O BAL narrows significantly, delivering 1.6× throughput over parallel-I/O BAL, while incurring an additional ~2.5 MiB BAL size overhead, which is non-negligible.

These results reinforce a key insight: although batched-I/O BAL delivers the best performance under pathological, I/O-saturated workloads, parallel-I/O BAL remains sufficiently robust even in worst-case scenarios—without incurring the additional BAL size overhead introduced by batching.

Benchmark suite available here:
GitHub - dajuguan/evm-benchmark