Is the STARK proof generation time given above (1.5-3s) for the same parameters as this? If so, that seems very weird. Why is go <2x faster than python for STARK generation but 4x faster for MIMC computation?

The Forward MiMC timings [1.85ms in C, 6.11ms in Go, 13.40ms in Python] are just the computations of mimc(3, 2**LOGSTEPS, constants) you have in your proof generation [just step (2a) in the diagram] and your verification [step (v1) in the diagram], with the same parameters (same 8192 steps, same 64 constants, etc.) … just in different languages, with their various bigint libraries. More cores doesn’t speed up Forward MiMC or Reverse MiMC, of course.

If you replace step (2a)'s Forward MiMC in Go here with Reverse MiMC then the STARK proof generation will take ~272.69ms (278.8ms-6.11ms) longer [1.42s will become 1.69s]., because (2a) is on the longest path.

If you replace step (2a)'s Forward MiMC in Python here with Reverse MiMC then the STARK proof generation will take ~1278ms (1,291.4ms-13.40ms) longer [that is, 3.78s becomes 5.05s].

Does that make sense?

For MiMC computation, the idiosyncrasies bigInt implementations (with memory setup/…) in each language obviously dominate, because that’s pretty much the only thing going on. For STARK generation, the computations of the longest path use bigInt everywhere in the pipeline, but because you can parallelize almost all the steps in the pipeline, there is no heuristic like “its X times faster when you use L language” – what the level of improvements ends up at will depend on the structure of the pipeline and the amount of resources allocated to it.

Succinct prototyping in Python (because it handles bigInts and vectors so well) and putting things into production in a C library with a Go/Rust/… wrappers is the sane thing to do for VDFs + STARKs.

The goal here is to be slow in the forward direction as that’s the direction the VDF is computed, and fast in the backward direction for proving/verification. So symmetric execution time is not desired.

Worth noting that the overall size of the proof generated by this code can be shrunk significantly by changing the Merkle root to root + n levels of branches and generating the proofs up to the highest branch than down to the root each time.

The MIMC stark code already does some Merkle proof compression via a separate compression algorithm that detects duplicate hashes and replaces them with a symbol. That said, I’m sure it’s suboptimal and there’s better ways to do the same thing, especially since any top levels of the tree that are fully covered by merkle branches can be left out entirely. I feel like optimal Merkle proof batching is a problem that someone somewhere has already come up with a neat clever algorithm for… something like “start with all the nodes for all branches, including both the branch and the sisters, and then repeatedly detect and remove any nodes for which both children are either in the proof or have been already been removed due to being calculable”.

Some sort of “proof tree”, if my mental image of the resulting structure is correct. Interesting idea but not sure how much of a saving it would be over a simple pollard, which wouldn’t require additional pointers to link the common branches of proofs together. Might have a play to see if there are additional space savings to be made.

Some years ago I developed an algorithm for the merkle tree traversal problem which asks the question which nodes to store of a merkle tree between generating authentication paths for leaf(i) & leaf(i+1) and which nodes to recalculate on demand. The goal is to have a good/optimal trade-off between required memory and required recomputation and it should be configurable. This is used for signature generation in merkle signature schemes. One component repeatedly used is the so called TreeHash algorithm which allows to calculate a node from its leaf requiring O(N) hashing and O(log(N)) memory (excluding the leafs (these are calculated in all merkle signature schemes instead of stored)). Where N is the number of leafs. Its a straight forward algorithm and probably used a lot outside of merkle signatures schemes as well. I think that by cleverly ordering the Nodes of your output this algorithm could be adapted to reduce the memory footprint and increase the performance of the verify process by recalculating the root from the proofs, proofing all proofs at once instead of calculating all the missing nodes and then proof each leaf individually. If I find time I will do an implementation of it

First compliments on the three-part Stark explainer, great work.

I finally came around downloading the python code to look at its inner workings and I think I found a weak spot.
At every stage of the recursion you sample the rows that are checked based on the merkle root of the values of the column. Because we sample only 40 rows, we can get away with quite some changes in the column without getting caught. For example, if we change 10% of the numbers in the column and do 40 samples, we have a chance of 1.4% of not getting caught. We can easily try multiple variants of the column to beat the odds.
When we start with a high degree polynomial and have 10 levels of recursion, it is possible to change all points in such a way that they are all on the same low degree polynomial once we reach the final check.

To make this attack harder we could sample the rows of all levels based on the last merkle root. In that case we have a chance in the order of 10^{-19} which is much better. Turning it into an interactive proof via the blockchain is also possible, but might be less practical in some cases.

To avoid this attack altogether and also make the proof smaller, we should follow the sampled points from the base layer all the way to the final polynomial. So instead of sampling just one value on the column per check, we should sample the 3 sibling values as well, to use in the next recursion step. Skipping the second lookup per merkle tree can reduce the total size of the proof significantly.
By following these “point-traces” all the way, we make sure they have not been tampered with. You could still try to tamper with the sibling points, but this is not useful without knowing the column that will be chosen next.

@vbuterin thanks for the code; finally had a chance to look at it and think about how it compares with pollarding.

Given that the chance of removing any given hash is related to the ratio of proofs to data values (plus level in the tree, of course) it’s no surprise the more proofs there are the better the overall reduction. At relatively low levels (1:256, or 64:16384 as was actually tested) there is little difference and depending on the encoding method for the data it can be less efficient than pollarding (as the empty placeholders still take up some space). But stepping this up to higher ratios brings in higher efficiencies, for example at 1:32 (512:16384) it is about 30% more efficient than pollarding (which is itself 60% more efficient than simple proofs). A nice increase for many real-world cases.

An interesting difference is that both generation and verification of proofs require all of the individual proofs to be present; with the former this maximises the gain and with the latter it guarantees all values are in place to enable verification. This makes pollarding potentially better for interactive proofs where the pollard can be sent once and subsequent proofs can refer to it.

The ultimate solution would be to have a progressive interactive version of your system where both sides remember the proofs that have been sent to date and so over time proofs sent can be more and more efficient as they require fewer new hashes to verify. Probably not something that would be a common requirement, however (and it comes with its own trade-offs).

If you don’t mind I’d like to write up the solution (thinking of calling it “spare tree proofs” unless I can come up with something catchier) as a companion piece to the one I wrote on pollarding.

I agree with @daria that Reverse MiMC sounds problematic: As an example, if you specifically want the VDF for collaborative randomness, then an adversary could use the faster reverse computation. You could easily fix this with another hash function anywhere in the process, but if you want VDF proofs to be owned to reward the evaluators then the fast reverse evaluation still breaks that, and you cannot fix that so easily.

Following the values that are sampled at the base level (as I described at the end of my previous post), instead of re-sampling at every step of the recursion, could unlock some extra optimizations to shrink the proof further:

Every merkle tree is sampled only once and we always collect four values. So we could put these four values together in the leaf of the tree. This would lower the depth of the tree by two levels and save some hashes.

Increasing the degree of the polynomial on the rows from 4 to 8 or even 16 does not come at a high cost and could save a level of recursion. The number of values per level will increase, but the merkle part would shrink. Especially for levels with deep trees the trade offs will be favorable.

At the final level of the recursion, we now publish all 256 values of the column to be able to verify the values of the previous level and check whether these values are on the same low degree (max 16) polynomial.
I don’t think we need to publish these 256 values at all. The verifier could just calculate the 40 values that should be in that column based on the values from the previous level and check whether these 40 values are on a polynomial with a max degree of 16.
Because we are sampling 40 values we might be able to stop earlier and check for a polynomial of degree 32.
For this to work we need to make sure that we will end up with enough distinct values at the last level of the recursion, so we have to adjust the sampling process a little.

Increasing the number of values at the base layer and lowering the number of samples, could shrink the size of the proof while keeping the same level of security. This is off course more computational intensive, but could have a positive trade off.
Note that this is not possible for the original version, because the attack described in my previous post would be easier, so it would lower the security of the proof.

@vbuterin - quick question: in your tutorial (STARKs, Part 3: Into the Weeds), to reduce the size of FRI proofs, you merge P(x), B(x), and D(x) polynomials into a single polynomial using random linear combination of P, P^{steps}, B, B^{steps}, and D like this:

LC = k_1 * P + k_2 * P^{steps} + k_3 * B + k_4 * B^{steps} + D

Is there a reason we can’t skip P and B parts and compute the combination like this?:

Ah yes, this is a very subtle point. P and B are degree < n polynomials, and D is a degree < 2n polynomial. Hence for a degree < 2n check to properly check the degree of both polynomials, we need to multiply P and B by x^n. However, if we do just that, then a value like P(x) = \frac{1}{x^n} would also pass, which is not what we want, and so we need to do a linear combination P'(x) = P(x) * k1 + P(x) * x^n * k2, which does successfully ensure that if deg(P) \ge n then deg(P') \ge 2n.

A special piece of hardware can be created that can quite efficiently compute the inverse cube root problem. This will then give time to fulfill the possibilities of many more hacking options.
I will give an example, there is an algorithm RandomX that fights with the ability to create special hardware, the mimc approach where the cube function is used does not fight the problem of creating special hardware and does not oppose it in any way.

Very nice tutorial! I have been playing around the tutorial, and I found it is very easy to understand with minimum but useful application. I am thinking one further application is to use MIMC and STARK to prove the storage of a large dataset, that is dynamic (rather than static in Filecoin setup).

A few questions:

If we want to use MIMC on encoding data in \mathbb{F}_q^k, k > 1 in similar sequential way, how could we extend current k = 1 version?

Do you know any progress on any GPU implementation of MIMC that is optimized for throughput?