2D data availability with Kate commitments

We can use Kate commitments to create a very clean and simple 2D data availability scheme with collaborative encoding (ie. O(N^2) data can be encoded by a very loose collection of N participants with each participant only doing O(N*log(N)) work).

The scheme works as follows. Suppose that you have N blobs, (D_1 ... D_n) each containing M chunks. The proposer of the blob publishes a Kate commitment to that blob. The proposer of the block (eg. the beacon chain block in eth2) runs a data availability check on each commitment and includes the Kate commitments to the blobs (C_1 ... C_n) in the block. However, they also use Lagrange interpolation to compute a low-degree extension of the Kate commitments; that is, they treat [C_1 ... C_n] as the evaluations of a polynomial at coordinates 1...n, and compute the evaluations at coordinates n+1....2n.

Call C_{n+1} ... C_{2n} these extended commitments, and call D_{n+1} ... D_{2n} the data blobs generated by low-degree extending the data blobs, ie. D_{n+1}[k] would be the low degree extension of [D_1[k], D_2[k] ... D_n[k]].

Because converting between evaluations and coefficients and in-place multi-point evaluation are linear operations, you can compute C_{n+1} ... C_{2n} despite the fact that the C_i's are elliptic curve points. Furthermore, because C_i is linear in D_i (ie. comm(D_1 + D_2) = comm(D_1) + comm(D_2) and comm(D * k) = comm(D) * k where the + and * on the right hand sizes are elliptic curve addition and multiplication), we get two nice properties:

  • C_{n+1} is the correct commitment to D_{n+1} (and so on for each index up to 2n)
  • If W_{i, j} = G * (as\_polynomial(D_i) // (X - \omega^j)) is the Kate witness verifying that D_i[j] actually is in that position in D_i, then you can take [W_{1, j} ... W_{n, j}] and also do a low-degree extension to generate [W_{n+1, j} ... W_{2n, j}]

To recap:

  • The block producer can generate the commitments C_{n+1} .... C_{2n} to the “redundancy batches” D_{n+1} ... D_{2n} without knowing anything except the commitments
  • If you know D_1[j] ... D_n[j] for some column j, you can reconstruct the values in that position in all of the redundancy batches, and those values are immediately verifiable

This gives us the conditions needed to have a very efficient protocol for computing and publishing the entire extended data. The design would work as follows:

  • When doing the initial round of data availability sampling, each node would use the same indices for each batch. They would as a result learn D_1[j] ... D_n[j] and W_{1,j} ... W_{n,j} for some random j
  • They can then reconstruct D_{n+1}[j] .... D_{2n}[j] and W_{n+1,j} ... W_{2n,j}
  • If there are \ge O(M) nodes (reminder: M is the number of chunks), and the data is available, then with high probability for every row i \in [n+1, 2n] there will be a node that learns D_i[j] for enough positions j that if they republish that data, all of D_i can be reconstructed

This gives us an end state similar to the 2D encoding here, except (i) we avoid any fraud proofs, and (ii) we avoid the need for one node to serve as the bottleneck that aggregates all the data to generate the extended Merkle root.

5 Likes

I really like this scheme. It is very elegant! For this scheme it might make a lot of sense to keep the data blocks the same size.

One note: while this is still O(n \log n), it will need an FFT in the group. Quite doable though since this isn’t expected to be big.

For this scheme it might make a lot of sense to keep the data blocks the same size.

Indeed! Though if we want to use it in an EIP 1559 context it could make sense to do what we were already doing for the Merkle data availability scheme and split each block up into multiple rows.

One note: while this is still O(nlogn) , it will need an FFT in the group. Quite doable though since this isn’t expected to be big.

Right. Even if we adopt the “split each block into multiple rows” technique there would still at most be only a few hundred rows.