Proper disk i/o gas pricing via LRU cache

One problem I see with this strategy is that it potentially makes gas estimation much more complicated. Naively, you could have the gas estimator just treat all storage reads as uncached, thus ensuring that it always over-estimates at worst. If you try to do any other strategy you run the risk of gas estimates being wrong if the transaction is not mined quickly for any reason, and thus causing wasted gas on failing transactions (making the problem worse).

I think the naive solution here is fine, but it will be an additional increase in client code complexity, which comes with many hidden long term costs.

1 Like

Indeed, it’s an annoyance, however I don’t know in practice how frequent this would be because it would only affect the states that are almost evicted. The odds of a user doing a txn containing a slot very close to eviction seems low (especially if the cache is on the larger size), but it would happen once in a while . E.g. if the cache stores ~1 month of state data, you would need to do a txn today for given state and almost exactly 1 month later and then have that state evicted right before your txn is included, but after you submitted it. If cache was tiny, then yes this problem could be more frequent.

A more complex strategy for wallets aligned with this reality would be to detect if state accessed by txn is close to eviction or not and if it is, estimate higher. Doesn’t seem trivial to do client side however.

If you go with the non-naive gas estimation strategy, then you introduce complicated cache flushing attack vectors, where an attacker my do things to cause other people’s transactions to fail due to flushing the cache. While this isn’t necessarily breaking, it adds yet another layer of complexity that needs to be considered when analyzing attacks against/security of Ethereum.

1 Like

I think there is a middle ground strategy. The node wouldn’t only know if a slot is in the cache or not, but would also know how close to being evicted it is.

Let’s say that we go for a relatively small cache of 5 million slots; the cheapest way for an attacker to evict some slot from the cache would be with an SLOAD. With current gas prices, you could theoretically push approximately 7142 slots into the cache (15,000,000 / 2100) per block. This means that if a slot is among the top 2.5 million (halfway up in the cache), then it would be impossible to evict it in less than an hour, or approximately 350 blocks. In real life, having an hour of blocks that only do SLOADs would be madness, so the real margin is probably around 3 or 6 hours.

So, a good strategy would be to assume that those slots will be in the cache, and slots that are close to being evicted wouldn’t.

1 Like

An attacker could craft transactions that only access new slots that’re not cached yet. But ok, I was assuming you wanted to generally lower gas cost for accessing slots. But if you still keep the gas estimation for accessing slots at a value that assumes the slot is not cached and then refund gas if the slot is cached it would probably work

Following up our discussion on twitter, I still think LFU (least frequently used) instead of LRU would be better fit, especially for the gas estimation concerns. It would also help with surge periods where there’s many new slots are written. With LFU most highly used apps would be inside the cache (Uniswap, L2 inboxes, etc) and they wouldn’t go out of cache due to surges. It would also greatly reduce evictions.

Problem with LFU is that many old and not used anymore storage slots could stay in the cache for a very long time, while new applications would take a while to get in the cache even if popular when they launch. E.g. Uniswap V1 or old opensea contracts would be cached “forever”.

Also, very popular state slot will stay in the LRU just like an LFU (e.g. Uniswap, L2 inboxes, etc.) since these are touched all the time.

I think the gas improvement is greatly exaggerated. You still have to do writes, you still have to calculate roots, which both also take a long time. Maybe 2x is realistic. Looking at Reth in stage sync (not calculating roots, not saving data to disk!) is not best metric.

Some clients are also using mmap databases that basically would act similar to this kind of cache, while scaling better to available memory. I think this is currently in use in Erigon and/or maybe Reth. In Nethermind we are still experimenting with Paprika, that uses mmap directly to achieve that.

Also keep in mind that having additional few GB’s on the heap puts a lot more effort on garbage collection that is used in Geth, Erigon, Besu and Nethermind.

Overall would like to see real numbers before getting to any conclusions.

2 Likes

I don’t have a great sense after reading this post, which seems interesting on the face of it by the way, what the trade-off curve looks like between the preliminary RAM req numbers you put here vs the expected benefit from such a change?

For example, how much of the benefit outlined would we lose if we took those min RAM requirements down 50%? Is it 1:1, less, more, etc?

1 Like

I recently collected some data very relevant to this topic.

This is a summary of state reads + state writes for balances/codes/nonces/slots over ~100 days of recent ethereum mainnet data (~750K blocks)

These tables / charts are derived from parity trace_block stateDiffs for writes and geth debug_traceBlock preState for reads. This means that the data is at the resolution of txs, not opcodes (see overview of tracers here)

High level summary

For each datatype this table shows:

  • the number of reads
  • the number of writes
  • the ratio of reads:writes
  • the number of underlying items
  • the mean number of reads+writes per items
┌───────────────────────────────────┐
│ Summary of State Reads and Writes │
└───────────────────────────────────┘
            │           │            │         │           │   mean  │    mean  
            │           │            │         │           │  reads  │  writes  
            │           │            │    r:w  │           │    per  │     per  
            │  # reads  │  # writes  │  ratio  │  # items  │   item  │    item  
────────────┼───────────┼────────────┼─────────┼───────────┼─────────┼──────────
  balances  │   514.5M  │    296.4M  │    1.7  │    21.0M  │   24.4  │    14.1  
     codes  │   246.6M  │      2.5M  │   99.3  │     2.8M  │   87.2  │     0.9  
    nonces  │   482.6M  │    121.3M  │    4.0  │    13.3M  │   36.2  │     9.1  
     slots  │     1.2B  │    340.3M  │    3.5  │   151.4M  │    7.8  │     2.2  

At what rate does each operation occur?

This table shows the rate of reads/writes for balances/codes/nonces/slots on the timescales of seconds/blocks/days/years:

┌───────────────────────────┐
│ Rates of Reads and Writes │
└───────────────────────────┘
            │          │    mean  │     mean  │   mean  │    mean  
            │          │     per  │      per  │    per  │     per  
            │  action  │  second  │    block  │    day  │    year  
────────────┼──────────┼──────────┼───────────┼─────────┼──────────
  balances  │   reads  │    56.3  │    680.6  │   4.9M  │    1.8B  
            │  writes  │    32.4  │    392.1  │   2.8M  │    1.0B  
     codes  │   reads  │    27.0  │    326.1  │   2.3M  │  850.1M  
            │  writes  │     0.3  │      3.3  │  23.5K  │    8.6M  
    nonces  │   reads  │    52.8  │    638.4  │   4.6M  │    1.7B  
            │  writes  │    13.3  │    160.5  │   1.1M  │  418.3M  
     slots  │   reads  │   129.3  │  1,564.6  │  11.2M  │    4.1B  
            │  writes  │    37.2  │    450.2  │   3.2M  │    1.2B  

How many times does each item get read or written?

Most items are read/written in just a single block:

For an item that is accessed across multiple blocks, what is the typical interval size between reads or writes?

Most interblock intervals are <10 blocks, and much shorter for certain datatypes:

Are reads/writes concentrated to a small number of addresses?

This depends on datatype, for all datatypes, >50% of reads/writes are in < 0.01% of the underlying addresses, sometimes much smaller:




Data Summary

Each of these tables/charts is pretty dense and could be stared at for a long time. But the broad strokes are:

  • most items are accessed a very small number of times
  • most accesses are concentrated to a very small set of items
  • if an item is accessed multiple times, those accesses are likely to be close together in time
  • reads are distributed more top heavy than writes

(This is a first pass. It would be good to further validate the data before using it in production. There are some additional transformations that might be useful, e.g. filtering out the validator balance changes)

Thoughts on the current proposal

  • Tiering storage costs by frequency of access is a natural thing to do in many contexts. I would love to see something along these lines, especially 1) if cache pricing reflects the reality of how nodes are already accessing this stored data because the pricing would become more accurate, and 2) if this could increase throughput of the network without much increased storage IO burden on nodes.
  • There’s a big question of how to track access efficiently and whether the benefits outweigh the overhead. Slot reads consume much less gas than slot writes. I would like to see numbers on the expected impact…if we’re talking about trimming gas usage by less than 5% it’s probably not worth the complexity.
  • Maintaining some well-specified LRU cache logic is one approach. Another approach is just storing the last block height where each slot was accessed and then pricing access according to some recency formula. Another approach is, at the end of a block, give partial gas refunds for any slots that were read by multiple transactions (e.g. slot warmth persists across all txs in block).
  • I like the relative simplicity of the LRU approach and nodes such as Reth already use a LRU cache for many things. But I would worry about forever locking clients into a suboptimal LRU architecture in case more efficient caching architectures are developed in the future.
  • LRU is probably a good first pass, but an even more effective algorithm might try to be stickier for the historically top 0.01% - 1% most used items. Whether it’s worth the extra complexity is TBD
9 Likes

After giving it more thought I think it is recipe for reliability and consensus issues disaster. Let me explain: In order to process blocks you need to have the cache built. If you don’t have the cache your gas metering will be wrong and you have consensus issue.So now there are two situations where you don’t have the cache:

  1. Sync - you just synced the state, you would also have to sync the cache. If the cache would be additive only that would be easy, but it is not, elements are also removed from the cache which makes sync complicated - you need to trach items order too.

  2. Restarts/shutdowns - You can break restarts into:
    a) clean shutdowns - should we persist the cache? Persisting few GB of data would take quite some time, easy to misconfigure when your docker or OS gives you 30s to shutdown by default. Loading them again on startup takes time too.

b) crashes - ok we crashed and we don’t have the cache. What can we do? Only way is to sync them from network - what if a large part of the network just crashed due to some bug? Ugh…Based on this I think this is not a practical solution at all and there are better experiments with what to do with the state, so it can be more in-memory while not changing any consensus rules. And then we could just increase gas limit!But the main problem with increasing gas limit is size of storage. And it is not size of state! The main problem now is size of Blocks and Receipts, which would grow even faster. And for that EIP-4444 is more important.

1 Like

Amazing analyzes! Lots of great insights.

Check DMs please

1 Like

Good concerns.

Could be made much simpler than LRU for the consensus part, like an append only approach with "cache generations. E.g. You have 3 generations of 100mb each. When new slot is touched, you append to current generation. When generation is full (100mb is reached) you start a new generation and delete the oldest of the 3 generation. Anything present in these 3 arrays is cheaper gas wise. This way you have an append only approach, which is O(1) to update and can easily store using hash chain in state tree and for light clients. Pretty straightforward for clients.

As many mentioned, clients can chose their own caching implementation, we just need to include as little as possible as part of the consensus.

As many mentioned as well, I think we can start with a much smaller “consensus cache”, like 100 mb to 200 mb. Clients already have a caching layer they need to rebuild when node restarts, so I don’t think this is much of a concern.

1 Like

The caching clients currently have is not consensus critical so it can be dropped and lazily repopulated as reads happen without any problems. This changes the caching to a consensus datastructure, which needs to be in sync across all nodes in the network. Shrinking the size may make this more feasible, but still not trivial even at hundreds of MB.

2 Likes

A potentially simpler implementation: Just keep things warm longer

  • Any slot accessed in the past 20 blocks is considered “warm” and costs less gas to read
  • Client implementations would need to track the last-block-access for only thousands of slots (potentially just kilobytes)
  • This would not lock clients into a particular caching architecture
  • Based on the above data I think it would cover the majority of reads, but need to validate this explicitly

Could also apply this logic to keeping accounts warm instead of just keeping slots warm

How much gas is spent on cold reads?

Still very relevant is the question of how much eth gas is actually spent on reads. A rough estimate using the above data:

(1564.6 mean cold read slots per block - 450.2 mean written slots per block) = 1114.4 mean cold reads per block
(1114.4 reads) (2100 gas per cold read) = 2.34M gas per block
2.34M / 15M = 15.6% of ethereum gas is spent on cold reads

(someone please double check this math)

1 Like

I do like the simplicity of time based solutions, but clients would still need to build an in memory list of storage slot every restart. The bigger issue with any time based approach rather than data size approach is that if we need to consider how big can the cache become if an attacker was to fill these N blocks with a lot new storage slots access. The cache size is dynamic effectively, which could cause some problems for clients making assumptions about cache size (e.g. RAM saturation).

20 blocks may be small enough, but how much would it allow us to increase the gas limit without running the risk of a “cache explosion attack”?

1 Like

I’m curious. Doesn’t this create a small number of large, very active contracts that will come to dominate the cache thereby creating two tiers? Smart contracts that are always in the cache (and therefore always cheaper) vs. those that are rarely in the cache and therefore more expensive. Doesn’t this establish winners-take-all in the long run? Won’t it become increasingly more difficult for new entrants to “break in”? That seems like a bad idea to me.

1 Like

My hot take: make it a single block. First read/write of a slot for a block is more expensive than other reads/writes.

Ironically, this kind of setup does the inverse to larger caches: (generally) pushes costs onto MEV bots and away from normal users (assuming tx is some form of sandwiching, JIT, frontrunning, liquidations, etc).

Larger caches not only do have a centralizing force at the app layer as @quickBlocks mentions, it also makes it cheaper for higher frequency MEV bots pushing more blockspace to them.

2 Likes

I share some of the reservations about this idea that others have expressed above (gas estimation complexity, creation of two tiers of contracts) but, withholding that, I have some thoughts on potential specification and implementation details:

Currently, in PBS mode, go-ethereum maintains 128 diff-layers in memory. This is a cache of all the state and storage trie nodes that have been mutated in the last 128 blocks.

Only once dirty nodes have settled to the 129th layer are they written to disk. On node shutdown the 128 in-memory layers are packed into a single database “journal” entry to be easily recovered on restart.

Perhaps this existing caching scheme could be used? Any storage operation performed on the state in these diff-layers could be discounted. A 128 block range covers at least 60-85% of the read and write intervals from the data @storm provided above. While it does not have a fixed byte size, this memory usage is already expected by the majority client and is small enough to be of little concern elsewhere. The amount of additional complexity introduced to geth would be absolutely minimized.

The downsides are that

  1. This cache does not include code, it only includes data embedded directly in trie nodes (balances, nonces, code hashes, storage roots, and storage slot value).
  2. More importantly, this cache only considers state which has been mutated. It does not cache state that was recently accessed but not mutated. So it would be a “Least Recently Mutated” cache if you will.

A concern which I haven’t seen expressed yet is how this would affect block builders in particular (whether or not the building process is separated from the proposer).

Builders execute different sets of transactions to find and build the most economically optimal valid block possible. Without requiring comprehensive transaction access lists they can’t use static analysis to factor in ahead of time the access pattern of the transactions and how that relates to the in memory cache. They only discover the access pattern while speculatively executing transactions. Their access pattern will be all over the place and the eventual access pattern (set of txs) that is proposed does not reflect all the I/O work the builder performed to get there.

Furthermore if the goal is to eventually transition to stateless validators using block proposers that submit block witnesses then, in that context, the only ones ever accessing state on disk will be block producers. Validators will be gossiped block witnesses (all state required to execute the block’s set of transactions) along with the block they need to validate. The state they need- no matter when it was last accessed- will be injected directly into their memory and never need to touch the disk.

1 Like