Binary trie format

axic · September 7, 2020, 12:29pm

Probably this is a dumb question, but since binarification requires all this recalculation, how about considering a switch from keccak256 to sha256 for the state trie?

The reasoning of Eth2 for using sha256 can be read here, including a claim about security.

I can think of two potential, but questionable, benefits:

Apparently one of the main benefits of sha256 is interoperability with other chains – could this potentially enable other chains to more easily verify proofs of Eth1 blocks? I don’t think it should make a huge difference anyhow.
Making Eth1 and Eth2 closer in terms of cryptographic primitives. Again, this is very slim, given there are more substantial differences between the two, starting with bls12-381 vs secp256k1.

Would there be any (other) benefit? Probably no significant ones, but maybe it is worth briefly entertaining the idea and discussing it.

gballet · September 11, 2020, 6:17pm

Would there be any (other) benefit? Probably no significant ones, but maybe it is worth briefly entertaining the idea and discussing it.

I can see three other potential benefits:

sha256 is available in many more languages and platforms and a better coverage from security audits
sha256 is faster than Keccak256
keccak256 has a security flaw that sha256 doesn’t

is a given, so I don’t think there is a big consensus risk to switching. Still, unless either 2) or 3) is true I would say that it’s better not to fix something that isn’t broken.

I don’t expect 2) to be obvious, and indeed the performance is exactly the same when I tried to use sha256 in M4.

gballet · September 17, 2020, 5:51pm

I was thinking: instead of adding an extra byte in order to access the account’s properties, and thus ending with 33-byte and 65-byte keys, how about dropping the last two bits of the hashed address (as in “secure trie” address) and using them as an item selector?

account_balance_addr = hash(account_addr)[0..253] ++ 0b00
account_nonce_addr = hash(account_addr)[0..253] ++ 0b01
account_code_chunk_addr = hash(account_addr)[0..253] ++ 0b10 ++ chunk_addr
storage_slot_addr = hash(account_addr)[0..253] ++ 0b11 ++ hash(storage_key)

2 bits are lost here, which doesn’t matter for this use case: hashing is intended to make sure that values are evenly spread out in order to keep the trie balanced. So the right-most the bit is, the least it helps with balancing the trie. Furthermore, all these values will still be stored together in the DB so it won’t have any impact on the balance.

A quick check as of block ~8M doesn’t show any collision if the last 2 bits are chopped off. It is quite unlikely (a one in 2x10²⁸ chance) that there will be a collision.

MrChico · September 17, 2020, 8:13pm

It’s nice to be able to verify the state trie from within the EVM, which is currently cheaper to do with keccak256. Although maybe a shift to sha256 can come with the appropriate repricing measures to make it feasible to verify state roots on chain.

SergioDemianLerner · September 24, 2020, 7:31pm

I suggest you use the same mapping as RSK uses, which has many benefits.

account_record = hash(account_addr)[0..79] ++ account_addr

This key->value mapping has the huge benefit that it stores the key inside the trie, and yet it randomizes the key position.
Also it’s faster, because you can generate 80 bits of Keccack hash digest faster than 256 bits.
Finally it’s requires only 30 bytes for keys, 2 bytes less than the 32 bytes proposed here.

poemm · October 16, 2020, 6:17am

First, two desired properties for the binary tree structure and merkleization rule. Then, claims that neither M1, M2, M3, nor M4 satisfy both of these desired properties.

Property: Access list-witness correspondence. For a given state tree, each access list uniquely determines a minimal witness. In particular, minimal witnesses are independent of block execution.

Property: Efficiently computable. The cost to merkleize a tree is less than 50n hashes in practical (i.e. average tree depth<50) cases, where n is the number of leaves to be merkleized.

Claim. Binary radix trees like M1, M2, and M3 don’t satisfy the access list-witness correspondence property.
Proof sketch: Consider an access list from a binary radix state tree. When a leaf in the access list is deleted from state after block execution, if the deleted leaf’s parent node has an extension (“prefix”, “edge label of length >1”) as its remaining child, then the extension is extended to preserve the radix property. Information about whether this remaining child is an extension is needed in a witness. Any leaves can be removed, and the access list does not specify which leaves are removed, so a witness must include this extra extension info for all hash nodes which may be extensions and may be extended due to leaf deletions. This extra data makes witnesses non-minimal. So M1, M2, and M3 don’t satisfy the access list-witness correspondence property.

Claim. Sparse trees like M4 don’t satisfy the efficiently computable property.
Proof sketch: In a sparse tree, a leaf requires 256 hashes to merkleize. And n leaves require ~256n hashes to merkleize. And ~256n>50n. So sparse trees are not efficiently computable.

Are there any proposals for a tree structure and merkleization rule which satisfy both of these desired properties?

BoltonBailey · October 17, 2020, 9:43pm

I definitely agree that M4 does not have the efficiency property. In order to be efficient and secure, there will have to be some kind of cryptographic encoding of the tree structure within the tree itself.

I don’t quite understand why you are saying that M3, say, does not have the access list determines minimal witness.

In your definition of M3, you say that prefix_length is usually 0x00. This is a bit confusing to me. Are you defining prefix_length of an internal node A to be

The length of the prefix shared by all descendants of A which is not shared by all descendants of A’s parent, or
The length of the largest sequence of bits which is a prefix of all leaf keys that descend from A?

It seems like your definition of M3 uses the former, but I think the latter definition solves this minimal witness issue.

As a diagram, consider the following

 A
/ \
   \
    \
     B
    /  \
   C    D

In both versions, when leaf D is deleted, its parent B is deleted and its grandparent A replaces B with C as its (right side) direct child. But in the former version, C’s prefix_length must be updated to what was previously the sum of A’s prefix_length and C’s prefix_length, which requires not just the hash of C but the data of C itself.

In the second version, C still has the same set of leaf descendants, and A has descendants on the left, and descendants of C on the right, so the “largest prefix of all descendants” is the same. Since C is unaffected, we can simply slot its hash in where the hash of B used to be in A.

poemm · October 26, 2020, 5:16am

The word prefix is an abuse of notation. My use of prefix in M3 is consistent with the posts above it which use prefix to describe an extension – so your first definition, I think.

Your definition of prefix_length gives nicer properties for node deletion. But your M3-like rule is bad for proofs-of-exclusion, where it breaks the access list-witness correspondence property. And your M3-like rule always breaks the “two 32-byte chunks -> one 32-byte chunk” property.

poemm · October 26, 2020, 5:17am

A proposal by Vitalik may satisfy both properties above, but we would have define the word “minimal” carefully.

poemm · October 26, 2020, 5:31am

A proposal for satisfying both properties above.

Aumasson’s Too Much Crypto is already making impacts in practical cryptography by motivating BLAKE3 to reduce the number of hashing rounds from ten to seven. Aumasson also concludes that SHA3 could use 10 rounds instead of 24 rounds without losing safety.

I propose that we consider using M4, but with Keccak reduced to 10 rounds. While this does not quite achieve the efficiently computable property, it may be fast enough for our purposes.

Furthermore, I propose that we evaluate reducing the number of rounds even further. Aumasson assumes a block chaining[1] mode of operation, but we only need one 136-byte block. Also, we use a tree mode of operation, with a sparse tree of depth 256 with likely \geq 10 rounds before we get to the first branch node. So I suspect that we can reduce the number of rounds even further. Conjecture: we can use Keccak with one round in our 256-sparse-tree mode of operation without losing safety.

More speedups can come from (i) choosing a faster hash function and (ii) parallelization.

[1] The term “block chaining” is from cipher block chaining, which I suspect is the origin of the word “blockchain”.

vbuterin · October 26, 2020, 6:12am

Interesting! It does seem like standard hash functions do not fare very well in the “64 → 32 bytes” use case, which binary trees depend on so heavily. Certainly could be worthwhile to lead and propose a candidate.

Conjecture: we can use Keccak with one round in our 256-sparse-tree mode of operation without losing safety.

I can see how this preserves preimage resistance, but it would break collision resistance, no? Need to define “safety” carefully here, as Merkle trees do require collision resistance!

gballet · October 26, 2020, 12:48pm

It is not clear to me why this property is desirable enough to justify the outright discarding of M3/M5. Could you please clarify?

How is it bad? And how bad is it?

A proof of exclusion need only contain the hash of the sibling nodes, which have to be present in the access list anyway in M4.

I have implemented it last week and it works just fine.

gballet · October 26, 2020, 12:51pm

@holiman suggested two weeks ago to use BLAKE2, I agree it’s a good idea.

Parallelism, however, is going to greatly increase the complexity both in memory usage and in locking. I’m pretty sure this complexity is going to offset all the simplicity gains from M4, and then some.

EDIT: I’m really no fan of reducing the number of rounds: this means maintaining our own version of a crypto library. Even assuming Vitalik’s concern about collision resistance are addressed, does saving M4 justify this?

poemm · October 26, 2020, 12:56pm

BLAKE2s seems to be the only popular hash function which is “64->32 bytes”. I am curious whether parallelized BLAKE2s with seven rounds would be fast enough for M4.

Agreed. Under statelessness, collision resistance breaks because a witness can provide arbitrary witness hashes, so an attacker can craft a witness hash which produces a collision in one round, and this bad hash could result in a bad post-state root, breaking the system. So the conjecture is false.

BoltonBailey · October 26, 2020, 10:40pm

I guess you are saying that the M3-like rule proofs-of-exclusion do not satisfy the access-list-witness property because to prove exclusion, one needs to descend the tree to a leaf and see the key there to know the prefixes.

Another way to fix this, you could just use M2, but switch to the “Longest prefix of all descendant leaf keays” definition of prefix.

One would have to make clear exactly how the prefix was represented. For example, you could do

internal_node_hash = hash( hash(prefix_followed_by_0s || length_of_prefix) || hash( left_child_hash || right_child_hash ))
leaf_hash = hash( prefix_bits || hash( 0 || leaf_value ) )

Here, to prove exclusion of a key, one only has to descend the tree to the lowest node that would contain that key in its descendants if it were present, and verify that both children of that node do not contain the key.

I’m not sure that I entirely understand what you mean by the access-list-witness property for proofs of exclusion though.

gballet · October 30, 2020, 4:35pm

For reference, I have re-implemented the M4 method that I had lost because of my disk crash. This is the difference in hashing times between M4 and M5 (M5 being the same thing as M3, without the prefix length in the leaf - in order to follow the 32x32->32 rule)

bintrie_hash

Note that the Keccak impelentations don’t use the Read method, which is not available for Blake2b, so there is some room for improvement here.

poemm · November 2, 2020, 2:09pm

You suggested an M2-like rule with path buts. Path bits increase witness overhead (i.e. not minimal, although more minimal than M3) and add bit-twiddling complexity which is undesirable. Also, your rule has 3x hashing overhead, which may exceed the efficiently computable property (unless you choose a fast hash function and replace “hashes” with “hashes relative to Keccek256 throughput”).

As you noted, in M3-like rules, a proof-of-exclusion requires a path to a descendant leaf to recover path bits. But such a leaf may not come from the access list and may have to be manually added to the witness. This adds witness bloat (breaking minimalness), and opens questions on how to make this unique (need more rules to get uniqueness, also undesirable).

poemm · November 2, 2020, 2:46pm

I spoke with Guillaume directly. He says that M5 is here. M5 is a M3-like rule. M3-like rules have witness bloat and are awkward for canonicalness. M4-like rules are more desirable because they have no bloat (i.e. they are minimal) and they have no awkwardness (i.e. “cleanness” as Vitalik said on Discord). The problem with M4 is that it requires ~10x more hashes than M1/M2/M3/M5-like rules, so we need a way to remove hashing overhead to make M4 practical.

BLAKE2b has 128-byte blocks, but we only hash 64-bytes, so BLAKE2b is not ideal. Based on your numbers, I suspect that hashing is not your bottleneck since BLAKE2b should be 2.5x faster than KECCAK for 64-bytes (6.05 vs 14.53 cycles/byte).

Here is a M4-like rule whose bottleneck is hashing.

M4_hash_bottlneck. Sparse tree structure with the following merkleization rule at each node:

node_empty_subtree = 0x0000000000000000000000000000000000000000000000000000000000000000 (i.e. 32-bytes of zeros)
node_nonempty_subtree = hash(left_child_hash || right_child_hash)
leaf_node = hash(leaf_data_64_bytes)

where node_empty_subtree is a node whose subtree includes no leaves in state, and node_nonempty_subtree is a node whose subtree includes a leaf in state.

Rule M4_hash_bottleneck with Keccak is not efficiently computable until we remove hashing overhead.

Consider BLAKE3 (i.e. BLAKE2s with 7 rounds instead of 10). BLAKE3 is is 5.2x faster than KECCAK over 64 bytes (2.78 vs 14.53 cycles/byte). M4_hash_bottleneck with BLAKE3 satisfies the efficiently computable property (if we replace “hashes” with “hashes relative to Keccek256 throughput”, since we can do 5.2 BLAKE3 hashes in the time of one Keccak256 hash). We can accelerate M4_hash_bottleneck further by vectorizing multiple paths to get another 4x speedup, and accelerate further with multi-threading, or maybe even GPU if needed. Any feedback on M4_hash_bottleneck with BLAKE3?

gballet · November 2, 2020, 5:14pm

EDIT I have previously posted measurements that were only 2.7x faster. I have seen merged the tries used in M4 and M5 tests, which gives better results to the M4 method, as shown below.

Good catch, there was an extra allocation issue in the code, the speedup is indeed ~3x. Tthe BLAKE2 code is already using the SIMD instructions available on my machine, which is why it’s exceeding the expected 2.5x speedup.

bintrie_hash

We are still looking at a 20x slowdown compared to Keccak M5, and a 40x slowdown compared to BLAKE2 M5.

holiman · November 4, 2020, 9:53am

Late to the party here, apologies if I’ve missed something whille reading through the communications above. So as I understand it, the discussions now is M3 versus M4, where the latter does not have extension-nodes, but instead branches off at every bit in the key.
@gballet has already talked about the efficiency loss at hashing the thing, which obviously is a pretty big thing.

However, it seems to me that such a design also severely hampers the implementation of trie lookups.

A trie-based lookup of a path leading to a leaf would essentially have to resolve 256 nodes to reach the destination, instead of maybe 4-5 (depending on the level of trie saturation). This is an insane amount of overhead.

And these types of lookups are not uncommon

A classic ‘fast-sync’ is basically a network-backed trie-iteration
Geth’s in-memory trie pruning depends on being able to keep trie nodes for the last 128+ blocks in memory, and do reference counting. If we explode the number of trie nodes with an order of magnitude, this becomes a severe problem.
Even with a snapshot-backed state-read, we still need to resolve the leafs when performing writes, potentially from disk.
Geth has several caches, for ‘clean’ trie-nodes and ‘dirty’ trie-nodes. If we explode the number of nodes, this makes those caches pretty useless, and forces orders of magnitude more disk accesses.

So as I see it, it’s not just that “if we branch at every bit, we need to solve the hashing problem”, the problem appears to me to be a lot bigger than that. Apologies if I’ve misunderstood something in the discussion.