A nearly-trivial-on-zero-inputs 32-bytes-long collision-resistant hash function

The Libra whitepaper released yesterday also suggests a strategy for preventing the need to compute 256 hashes per Merkle proof in a sparse Merkle tree. Basically, subtrees consisting of exactly one leaf are replaced with a single node.

image

This does arguably does increase the complexity of the sparse Merkle tree interface; their implementation is 460 lines of code.

In Compact Sparse Merkle Tree this approach is discussed as well

3 Likes

The proposed approach has two suboptimalities:

  1. It still computes the hash every 16 steps.
  2. The hash technically only uses 30 bytes, not 32.

How about the following construction instead:

For each node we compute a tuple (H, A), and at the root (assuming root always has both children) A is discarded. << below means cyclical shift.

  1. If l \ne 0 \land r \ne 0 then (H, A) = (sha256(H(l) << A(l), H(r) << A(r)), 1)
  2. If l = r = 0 then (H, A) = (0, 1)
  3. If l = 0 then (H, A) = (H(r), 2A(r))
  4. If r = 0 then (H, A) = (H(l), 2A(l) + 1)

It only produces shareable hashes (A = 1 and thus can be discarded) for nodes that have both children, which is generally the case for root I assume, and we usually donā€™t share hashes that are not the root of the tree?

A is effectively a bitmask of left and right turns taken going up on the chain of single children. Such a mask naturally cannot exceed 256 bits (since we take at most 255 steps as we go up to the root, and thereā€™s leading 1), so shifting by A is always possible.

A particular hash that we feed into sha256 in (1) is the last computed sha256 shifted by such bitmask. For collision to occur here two sha256's need to be cyclical shifts of each other, which appears unlikely.

That has the issue that your hash is effectively 64 bytes, so it doesnā€™t fit neatly into the format of a 32-byte merkle tree. Thatā€™s roughly the direction I was going for before I figured this newer approach out :laughing:

I donā€™t think reducing hash security to 2**120 (or even 2**112 if we want to cut the hash overhead further from 16 to 8) is a particularly big deal. The main thing I dislike about this scheme is just the fact that it requires using a ā€œdifferent hash functionā€ so it doesnā€™t cleanly with eg. other SSZ structures.

Lets look at matrix operations as the basis of a ā€˜hash functionā€™ then.

You want to verify that H(x,0) != H(0,x), or ā€¦ do you want that to be true? Where the hash function retains the associative and commutative properties of some underlying field.

Using matrix operations we can construct a function which returns the same result regardless of which side the input is on, if we say Ax is a matrix-vector-product, where you have some uniformly randomly chosen matrix A within a field.

Then you have x and y, which represent x=(a,0) and y=(0,a), you want Ax = Ayā€¦ This means that both of the inputs may be independent, but the sum of the two inputs equal the output.

Lets simplify this first, and provide a clearer example:

a(0) + b(x) = a(x) + b(0)

Which implies a and b are identical, where the sum evaluated at two points is also equal for either. Where a(0) = b(0) = 0 rite, and a(x) = b(x) = x?

Soā€¦ lets take this back to matrix operations, where we want to directly translate the above statements into a problem which is ā€˜hard to solveā€™, and hopefully cryptographically secureā€¦

Lets say x and y are our two inputs, where x is left and y is right. We need a matrix where the vector product relation exists such that 0 || x = y || 0 = x || 0 = 0 || y.

So, A\cdot({\vec{x}+\vec{0}}) = A\cdot({\vec{0}+\vec{x}})

This preserves the commutative relationship, albeit over a matrix and a vector, which is kinda what we needā€¦ Even if the matrix A is a Cauchy matrix, MDS matrix, or pretty much any other kind of matrix we retain the commutative and associativity laws when operating over a finite field.

Take, for example, the following Sage code:

from random import randint
q = 1244142437461793964053
n = m = 123 # for example
G = GF(q)
A = MatrixSpace(G, n, m).random_element()
RandomBinaryVector = lambda length: vector(G, [randint(0,1) for _ in range(length)])
x, y = RandomBinaryVector(n), RandomBinaryVector(n)
zero = vector(G, [0] * n)
assert A*(zero + x) == A*x
assert A*(y + zero) == A*y

However, the problem isā€¦ how do we make this not only cryptographically secure, but efficientā€¦

At the moment it is neitherā€¦

With a hash function you would normally want to preserve the order of the inputs, so H(a,b) != H(b,a), but with a function which is cryptographically secure and preserves the commutative and/or associative relationshipsā€¦ youā€™re seeking a primitive operation which isnā€™t widely usedā€¦ not only is it not widely used, but my personal analogy is that you might as well be juggling chainsaws when youā€™re dealing with polynomials and any hope of ā€˜securityā€™ - especially when all commutative and associative relationships are preserved.

You can truncate a matrix down to a single field element, which gives you a transform of n\times m \to n \to 1 reduction, where the last step is the first element from the result, and the previous is a sum of some kindā€¦

Butā€¦ surely the whole ā€˜security argumentā€™ about Poseidon and newer polynomial/matrix functions is that they disrupt the associative and commutative relationships, in the name of security.

1 Like

I definitely want H(a, b) != H(b, a), with it being computationally infeasible to find exceptions.

Another simple method to optimize SMT without security loss.

The hash merge function:

  1. if L == 0, return R
  2. if R == 0, return L
  3. otherwise sha256(L, R)

This function has one problem, merge(L, 0) == merge(0, L), which can easily construct a conflicted merkle root from a different set of leaves.

To fix this, instead of use hash(value), we compute a leaf_hash = hash(key, value) as leafā€™s value to merge with its sibling.

(we can store leaf_hash -> value in a map to indexed to the original value).

Proof:

Since the key is hashing into the leaf_hash, no matter what the value is, the leavesā€™ hashes are unique. Since all leaves have unique hash, nodes at each height will either merged by two different hashes or merged by a hash with a zero(for a non-zero parent, since the child hash is unique, the parent is surely unique at the height), until the root, if the tree is empty we got zero, or if the tree is not empty, the root must merge from two hashes or a hash with a zero, itā€™s still unique.

more details: https://github.com/jjyr/sparse-merkle-tree

This looks like it would greatly increase the size of the pre-computed table of hashes of empty subtrees. Not great for clients, a deal-breaker for on-chain proof verification.

This looks like it would greatly increase the size of the pre-computed table of hashes of empty subtrees. Not great for clients, a deal-breaker for on-chain proof verification.

Which proposal are you referring to by ā€œthisā€? AFAIK my own proposal above does not require any pre-computed tables.

1 Like

Oh weird I may have not replied to the right comment. I meant the proposal immediately above my reply.

1 Like

The probability you go into step 3 is 1/2 + 1/(2**16). Why do you claim that this method only require about 1 + d/16 ā€œrealā€ hashes?

It is 1/2 + 1/(2**16) because thereā€™s 1/2 possibility for r or l to be larger than 2**255 and thereā€™s another 2**16 possibility for r or l to be less than 2**240.

thereā€™s 1/2 possibility for r or l to be larger than 2**255

Why would this be the case? Node values are not randomly distributed in the case where you have a single-leaf branch going up, as the node values going up the branch follow the pattern of "start at some value under 2^{241}, add 1 bit per round for 15 rounds, then hash and mod back under 2^{241}".

Uh! I understand your idea! Itā€™s interesting! Thank you!