Cheap hash functions for zkSNARK merkle-tree proofs, which can be calculated on-chain

accumulators

#1

Recently I have been working a problem of with a difficult set of constraints, and I would welcome any input or insight into the problem.

I need to build a Merkle-tree on-chain, which later on I can prove membership of using a zkSNARK proof, this means I would be able to do a zero-knowledge proof of membership and other fun things.

The problem is: using SHA2-256 (or keccak256) is very expensive to prove inside a zkSNARK circuit, especially so when you need to prove 29 of them for a merkle tree path, and even moreso when you’re trying to make it run in reasonable time in WebAssembly (which is about 42x slower).

Anyway, this lead me to polynomial hash functions over GF(p), UOWHFs etc. which can be quite cheap to evaluate in EVM. e.g. the Hash127 function can compute 32 bits per loop iteration, requiring a MODEXP and MULMOD instruction each iteration (or just a MULMOD, is using precomputation).

However, many of the algorithms I’ve found are either ciphers or MACs which require a key, and in some cases, with the key, a collision can be found in linear time. e.g. with output f(?) = x I can find any number of inputs y where f(y) = x. This is fine if the input, say the leaf of the Merkle tree, is secured in another way, e.g. f(H(w)) = x requires you to find a collision where the input to f matches the output of another more secure function H, but chaining f together to authenticate a Merkle tree path is equivalent in security to a single f until you get to the leaf and need to find a collision between both H and f. e.g. the collision resistance is finding a specific input to the function, not a specific output.

Here are some interesting papers which discuss the topic in more detail, but they make no claims about preimage resistance or their suitability as cryptographic hash functions:

This seems to be a widely studied topic, and finding a few candidate functions with strong security guarantees which can also be computed cheaply in EVM would significantly reduce the complexity of zkSNARK circuits - making them provable on mobile devices and the web.

Potential functions:

Any ideas?


#2

This is an interesting metric, and points towards the limits of WASM. A 42x speed boost may be enough to justify a precompile for some basic building blocks in Ethereum 2.0.

Another side note is that in Ethereum 2.0 we want to use a STARK-friendly (and ideally SNARK-friendly) hash function at the protocol layer for things like the hash chain, Merkle trees, proofs of custody, etc.

I expect this question to become increasingly relevant. StarkWare (which received a large grant from the EF) is in the process of preparing a report on STARK-friendly hash functions. I suggest you look at their STARKs paper for the time being which has some discussion and benchmarks. (See the Davies–Meyer hash based on the Rijndael block cipher on page 15, the tables on pages 16/17, and appendices E and F.) Eli Ben Sasson is probably the most relevant person to talk to from StarkWare.

MiMC is one of the candidates that Zcash has looked into quite some detail. See for example this GitHub issue. Daira is leading the MiMC effort with Zcash.

Another suggestion is to contact Dmitry Khovratovich who has looked at this question quite deeply. He is an expert on hash functions generally, and SNARK/STARK-friendly hashes specifically.


#3

What’s the basis for this benchmark? is it a wasm interpreter or a wasm JIT engine?


#4

Using Node 8.11, which presumably uses a JIT under the hood.

It wasn’t by any means an extensive benchmark, but it compares a small handful of libsnark test programs from ethsnarks-emscripten built with Emscripten and default compiler settings for the project.

I’m sure there may be a way to tweak compiler flags etc., but I’m doubtful that it could be reduced to even 10x native, e.g. the Cloudflare bn256 optimisations make 10x-20x improvements versus ‘slow native 64bit code’, and wasm is stuck with 32bit arithmetic and no access to native instructions or inline assembly.


#5

Cool, I glanced at the repo but its not clear what program you are benchmarking. If you add a bit of documentation we’d be interested in incorporating this into ewasm benchmarks (where we’ll compare performance of various wasm engines).

But I’m confused what you mean by “calculate on-chain”. Are you more concerned with efficiently generating the snark proof, or verifying the snark proof? It sounds like you are talking about generating the proof, but I don’t see why you’d want to do that in EVM (i.e. why you’d want to meter the proof generation).


#6

I will add some benchmarks and more documentation, but I was using test_hashpreimage.cpp and test_longsightf.cpp as the benchmarks, they generate a key pair (proof key & verifcation key), then proves the circuit using the proving key, then verifies the proof using the verification key. After building the ethsnarks-emscripten run:

node ethsnarks/bin/test_longsightf.js         # takes about 35s
node ethsnarks/bin/test_hashpreimage.js  # takes much longer

It outputs debug information about how much time each step takes, e.g. verification of a zkSNARK proof takes ~0.75s, and LongsightF takes ~20s to prove (with 322 rounds), the timing can be compared to the native build of ethsnarks by building it separately and running the same programs - this gives me a direct comparison between native and wasm.

I’ve uploaded the .js and .wasm files to: https://github.com/HarryR/ethsnarks/releases

I am building a merkle-tree on-chain that anybody can append items to. The hash algorithm used to build the merkle-tree must be computable on-chain in Solidity / EVM. Then proof of membership of that merkle tree is proved using a zkSNARK and verified on-chain.


#7

So, my research continues - with the Davies–Meyer construct, with an invertible function E it’s easy to find a fixed point where you control both the key and the message, likewise with the Matyas–Meyer–Oseas construct which is essentially the same but with the key and the message switched. But… there are better constructs where the computational hardness of the hash function doesn’t rely upon the computational hardness of the underlying cipher…

When operating over a Galois field any exponentiation becomes a ‘hard problem’, and it’s easy to construct a function where the subset sum problems difficulty is directly translatable, but what if we want to do better than that?

My argument is that 90% of ‘applied cryptography’ tries to make it fast in the GF(2) domain, but with zkSNARKS any optimisation for binary domains will slow things down 100x or even 200x. And even with hash functions we rely on a bit-by-bit sampling of the input as a ‘safe’ mechanism - which if you have a true UOWHF doesn’t matter - but abiding by the rules of binary computers you add far too much complexity when real prime fields are as cheap.

@JustinDrake can you PM or e-mail me the contacts you have for Eli Ben Sasson and Dmitry Khovratovich, I think it’s relevant that I drop them an email with my findings and to generally get in touch etc.


#8

I don’t understand your concern about Davies-Meyer. One could easily find a fixed point for each Merkle node, but is that really a problem? If users are able to put arbitrary data in the leaves, then they could make a leaf’s parent have the same state as the leaf. But normally leaves are hashes of user-chosen data, and as long as the hash is preimage resistant, then finding data which hashes to the fixed point will be computationally hard.

FWIW Katz et al. also use Davies-Meyer + LowMC to do Merkle proofs for ring signatures.