The Data Availability Problem under Stateless Ethereum

pipermerriam · February 26, 2020, 4:57pm

Here is a write up on a more “middle-of-the-road” approach to mitigating this issue which I’m now referring to as the “Data Retrieval Problem”.

Here I pose an approach which:

Attempts to limit the ability to create stateless clients via “abuse/misuse” of the on-demand state retrieval primitives which are aimed at facilitating full node syncing.
Greatly improve the overall efficiency of the network with respect to syncing the state (unlocking the data prison)
Facilitate the creation of stateless clients which expose limited user facing functionality (they would be unable to handle arbitrary eth_call type requests which touch arbitrary state).

First class stateless clients via witnesses

First, we need to allow stateless clients to exist in a first class way via witnesses. We can theoretically do this now without any of the efficiency improvements that binary tries and code merklization provide. Care would need to be taken to be sure that “attack” block which produce huge witnesses don’t effect the traditional network.

This would involve coming to a “final” version of the witness spec and ideally including a hash reference to the canonical witness in the block header. We might need to include a “chunking” mechanism in there somewhere to ensure that retrieval of large witnesses can be done in a manner that allows incremental verification but I believe the current witness spec actually already provides this.

By formalizing witnesses and placing them in the core networking protocol we allow beam syncing and stateless clients to exist on the network without abusing the state retrieval mechanisms that nodes which are working to become full nodes are using.

State sync scalability via swarming behavior

Next, we aim to improve the efficiency of nodes which are working towards being full nodes by syncing the state. Alexey has used the term “data prison” to refer to the ?fact? that fast sync mostly works because it pulls state from multiple peers. Attempting to pull the full state from a single peer tends to run up against hard drive limitations to read the data fast enough.

To accomplish this we need to come up with an agreed upon algorithm that nodes syncing the state will use to determine which parts of the state they fetch at a given time. This approach needs to have the following properties.

Zero active coordination. This needs to be emergent behavior based on publicly available information.
Swarming. Two nodes who are both in the process of syncing the state will converge on requesting the same data.

There are likely many ways to do this. I pose the following approach as a simple example that has the desired properties as a starting point from which we can iterate from.

First we treat the key space of the state tree as a continuous range 0x0000...0000 -> 0xFFFF...FFFF (the range wraps around at the upper end back to the lower end). Nodes which are syncing the state use an out of protocol agreed upon “epoch” length. For the purposes of this post we will use EPOCH_LENGTH = 256. We define epoch boundaries to be BLOCK_NUMBER % 256 == 0. At each epoch boundary, a node which is syncing the state will “pivot”. A node that is just coming online will use the last epoch boundary as their starting point.

At each “pivot” a syncing node looks at the block hash for the boundary block. Supposing the block hash was 0x00001...00000 the node would begin iterating through the state trie in both direction from that key, fetching missing values as they are encountered.

This approach results in uncordinated swarming behavior, allowing nodes which are only partially synced to more reliably serve the requests of other syncing nodes.

Limiting the abuse of state sync protocols

This builds from the previous point which adds uncoordinated swarming behavior to nodes that are syncing the state.

A full node may decide to not serve requests for state that are far enough away from the current or most recent epoch boundaries. Supposing the latest epoch boundary was 0x5555...5555, a node may choose to refuse to serve a request for state under the key 0xaaaa...aaaa since that key is very far away from the current place where it would expect nodes to be requesting the state.

A full node could choose to implement more advanced heuristics such as keeping track of the keys for which a node has recently requested and refusing to serve requests that hit widely disparate key locations.

The goal here is to provide full nodes with heuristics which are reliable enough to both allow them to reliably descriminate against nodes which are not following the swarming behavior as well as making it more difficult for nodes to abuse the state retrieval mechanisms to implement the types of stateless client functionality that the network is unable to broadly support.

More research is needed to more reliably determine whether this approach to discrimination would be effective or whether it would be easily circumvented.

Summary

The proposal above hinges on retiring GetNodeData and replacing it with a primative which allows the serving node to know the key in the state trie that a given request falls under. This is required for full nodes the properly discriminate against clients which are abusing the protocol.

This plan does not allow for stateless clients which expose full client functionality. In order for the network to support stateless clients that can expose the full JSON-RPC API we will need to figure out a scalable approach for on demand retrieval of arbitrary state.

This plan also does not address the desire to allow nodes to forget ancient chain data. The network will still be dependent on nodes with the full chain history to serve it to others who do not yet have it, and the network will still be subject to abuse by nodes which choose to forget ancient chain data and merely retrieve it on demand from the network. Intuition suggest this is a less of a problem than on-demand retrieval of arbitrary state.