Binary trie format

vbuterin · July 4, 2020, 8:41pm

The goal of having only one type of node is to reduce the amount of DB reads since IO is a bottleneck in clients, and also save the extra 32 bytes for the pointer to the extension’s child.

Another possible option is a compromise similar to what I suggested in Optimizing sparse Merkle trees, where from a hashing point of view the structure can be represented as a binary hash tree, but in the database it would be a different structure. That is, we define hash_tree_node(node) as follows:

Let tree_node = (prefix, left, right, value)
If prefix is empty, return hash(0, value) if there’s a value, otherwise hash(left, right)
If prefix is nonempty, return hash(prefix, hash_tree_node(EMPTY_PREFIX, left, right, value))

At the DB level, I imagine you would be compressing many of these nodes together anyway, so further differences between hash structure and storage structure don’t really matter much.

This relies on the assumption that (i) 0 represents the empty prefix, and (ii) nonempty prefixes are distinguishable from valid hash outputs.

There is a bit of efficiency decrease in hashing, but it’s a negligible amount because as I mentioned, typically only the bottom leaf in the tree is an extension node, whereas the O(log(n)) ~= 28 nodes in the middle are not.

I think my desire to have the entire tree be navigable as a binary hash tree comes from (i) desire to be compatible with eth2’s SSZ frameworks, and (ii) desire to be easily compatible with future ZKP frameworks, which work more easily if hashes can be assumed to be in the “two 32-byte chunks → one 32-byte chunk” format. Though if there’s better ways to achieve those goals I’m happy to listen; I’m not at all convinced that the specific thing I wrote above is optimal.