On-Chain Non-Interactive Data Availability Proofs

musalbas · July 8, 2019, 2:59pm

The idea of including data availability verification as a pre-compiled smart contract is interesting. However with regards to the time-dependence of data availability, I’m also concerned about this case: what happens if isAvailable returns true for everyone at a certain point, but the data is then lost because the data behind that particular hash isn’t very popular, causing future nodes that validate the chain to reject that chain and thus fork? When data availability proofs are used in the Ethereum 2.0 context, it’s not as much as an issue, because only the availability of the block is being verified, and the data behind the block is assumed to be sufficiently popular as the community using the chain has an interest in it (which is also the status quo in Eth1 etc when it comes to e.g. pruned nodes).

I suppose to prevent that you also need the nodes that are checking data availability to also store the chunks that they are sampling in the long-term, to guarantee their availability in the future. In that case, I think the scheme could have similar properties to LazyLedger as nodes in the network are collectively helping to guarantee the availability of user-submitted data, though with higher overheads as you have to sample from multiple erasure coded Merkle trees of data.

The ‘data published near the time boundary’ issue seems harder to solve though, as the person who holds the data behind an arbitrary hash could release it long after a block has been generated and isAvailable is false, causing future block validators to reject that chain because isAvailable should actually be true. The implicit voting suggestion by @vbuterin seems reasonable though. Even if isAvailable returns false incorrectly as data was released too late, that only effects liveness but not safety, and you could re-submit the data availability check again via the pre-compiled contract.

Nitpicks:

This isn’t non-interactive in the same way as my earlier attempt, as nodes verifying the chain have to interactively sample chunks of the specified hashes, i.e. you wouldn’t be able to verify blocks offline. However I suppose you could make it non-interactive using the Fiat–Shamir heuristic but this would mean a lot of people would be sampling the same chunks and would reduce the security of data availability proofs, or use a client-specific precomputed challenges or hidden services (but that wouldn’t be completely non-interactive).
The hash addressed data can’t be arbitrary data, but must be erasure coded data correctly formatted for the type of data availability proof. Also, the pre-compiled contract would have to accept fraud proofs of incorrectly generated erasure codes (or some kind of proof that the erasure code is correct). I guess this means you need to factor in some delay to wait for fraud proofs before isAvailable returns true.
The essence of sub-linear data availability proofs is client-side random sampling of erasure codes. Erasure codes allow for the reconstruction of some data (of size M encoded as N chunks) using any M of N chunks. While this may seem linear in cost, we can do better by splitting up the data in two dimensions, requiring only √M chunks to be probabilistically sampled.

The number of chunks that need to be sampled by each client is actually O(1) (and for the entire network collectively it’s O(M)). The reason for using 2D coding instead of 1D coding is so that fraud proofs for incorrectly constructed code can be roughly O(\sqrt{M}) instead of O(M). The cost of performing a data availability check is O(\sqrt{\mathsf{blocksize}} + \log(\sqrt{\mathsf{blocksize}})) as you need to need to sample a fixed number of chunks, plus Merkle proofs for those chunks (from row/column roots) which are each \log(\sqrt{\mathsf{blocksize}}) sized, plus 2\sqrt{\mathsf{blocksize}} row and column Merkle roots. So the \sqrt{\mathsf{blocksize}} bit isn’t for downloading chunks, but fixed-sized Merkle roots for each row/column.