You can parallelize the prover very heavily as well; FFTs, Merkle tree generation and FRI are all highly parallelizable.
Hashbased VDFs, MIMC and STARKs
I streamlined the Go STARK implementation with goroutines, channels + big.Int optimizations and on an ordinary laptop (2014 MacBook Pro 2.2 GHz Intel Core i7 with 8 “logical” cores) got to this basic level of temporal performance:
NUM_CORES 
STARK Proof Generation  STARK Verification 

1  3.12s  46.11ms 
2  2.01s  24.70ms 
4  1.49s  15.64ms 
8  1.42s  16.58ms 
16  1.49s  18.30ms 
32  1.48s  20.18ms 
ethereum/research python  3.78s  52.10ms 
Wow! So python actually is not all that suboptimal for this stuff.
Can you also give the numbers for MIMC calculation, and reverse MIMC calculation (replace x := x**3+p
with x:= (xp)**((2*p1)//3)
mod p)?
Also, any idea why adding more cores doesn’t improve things beyond 1.4s? Is there some part of the computation you’re not parallelizing? Merkle tree calculation, FFTs, and FRI computation should all be quite parallelizable.
Yes  will report back on this shortly.
The program flow has to actually use the cores available to it to get improvements, with goroutines/channels/waitgroups/… structured in a pipeline matching the cores nicely – so if you have a machine with 64 cores with only 3 go routines running computation (1), (2) and (3) in parallel, you aren’t going to do much better than with just 3 cores. On the other hand, if you have a machine with 8 cores, having 128 goroutines running through 128 pieces of a for loop will have worse performance than if you just have 8 goroutines. To minimize the time of STARK proof generation, its essential to take the computational dependency graph, break it into stages, and focus on the key bottlenecks of the longest route. This is what the pipelined version of your STARK looks like in my Go reimplementation, with sample times attached:
The longest route is currently (2a+b+c) => (4a+b) =>(7) => (8) => (9) => (11) => (12) (shown in red). Inside the longest route are the time consuming FFTs + inverse FFTs (in step (2b)+(2c)], multi_interp_4 (in step (11)) – I did make a first pass at parallelizing these internal operations, but I’m certain we can do a lot more.
Here is the verification part, much easier to reason about:
Most of the time, breaking up a for loop into N goroutines [as in (v4)] is totally worth it (for some N, but less worth it for 2N), but sometimes its not (because the overhead of the goroutine is just too high) – so it makes for a lot of empirical engineering. With gradient descent in the parameter space, we can have confidence there is another 2540% speedup to be squeezed out of the proof timing. We’ll try out beefy machines with 2432 cores for comparison this month and report back.
Why can’t the FFT be massively parallelized? There’s no reason for parallelization to be between functions and not within one function.
No debate that FFTs should be parallelized. Here are the basic stats on Forward + Reverse MiMC in C, Go, Python, JS (code here):
Implementation  Forward MiMC  Reverse MiMC 

C  mimc.c

1.85ms  122.7ms 
Go  mimc_test.go

6.11ms  278.8ms 
Python  mimc.py

13.40ms  1,291.4ms 
Node.js  mimc.js

108.29ms  13,910.1ms 
[Stats taken from an 2015 iMac 3.2 GHz Intel Core i5]
Is the STARK proof generation time given above (1.53s) for the same parameters as this? If so, that seems very weird. Why is go <2x faster than python for STARK generation but 4x faster for MIMC computation?
The Forward MiMC timings [1.85ms in C, 6.11ms in Go, 13.40ms in Python] are just the computations of mimc(3, 2**LOGSTEPS, constants)
you have in your proof generation [just step (2a) in the diagram] and your verification [step (v1) in the diagram], with the same parameters (same 8192 steps, same 64 constants, etc.) … just in different languages, with their various bigint libraries. More cores doesn’t speed up Forward MiMC or Reverse MiMC, of course.
If you replace step (2a)'s Forward MiMC in Go here with Reverse MiMC then the STARK proof generation will take ~272.69ms (278.8ms6.11ms) longer [1.42s will become 1.69s]., because (2a) is on the longest path.
If you replace step (2a)'s Forward MiMC in Python here with Reverse MiMC then the STARK proof generation will take ~1278ms (1,291.4ms13.40ms) longer [that is, 3.78s becomes 5.05s].
Does that make sense?
For MiMC computation, the idiosyncrasies bigInt implementations (with memory setup/…) in each language obviously dominate, because that’s pretty much the only thing going on. For STARK generation, the computations of the longest path use bigInt everywhere in the pipeline, but because you can parallelize almost all the steps in the pipeline, there is no heuristic like “its X times faster when you use L language” – what the level of improvements ends up at will depend on the structure of the pipeline and the amount of resources allocated to it.
Succinct prototyping in Python (because it handles bigInts and vectors so well) and putting things into production in a C library with a Go/Rust/… wrappers is the sane thing to do for VDFs + STARKs.
You probably don’t want to use Reverse MiMC; it’s very slow. Instead use MiMC in Feistel mode, so that decryption is as fast as encryption.
The goal here is to be slow in the forward direction as that’s the direction the VDF is computed, and fast in the backward direction for proving/verification. So symmetric execution time is not desired.
Worth noting that the overall size of the proof generated by this code can be shrunk significantly by changing the Merkle root to root + n levels of branches and generating the proofs up to the highest branch than down to the root each time.
I couldn’t find a name for “root + n levels of branches” so called it a Merkle pollard. Details are at https://medium.com/@jgm.orinoco/understandingmerklepollards1547fc7efaa (skip to the section “Merkle pollards” to avoid the part explaining what Merkle trees are).
The MIMC stark code already does some Merkle proof compression via a separate compression algorithm that detects duplicate hashes and replaces them with a symbol. That said, I’m sure it’s suboptimal and there’s better ways to do the same thing, especially since any top levels of the tree that are fully covered by merkle branches can be left out entirely. I feel like optimal Merkle proof batching is a problem that someone somewhere has already come up with a neat clever algorithm for… something like “start with all the nodes for all branches, including both the branch and the sisters, and then repeatedly detect and remove any nodes for which both children are either in the proof or have been already been removed due to being calculable”.
Some sort of “proof tree”, if my mental image of the resulting structure is correct. Interesting idea but not sure how much of a saving it would be over a simple pollard, which wouldn’t require additional pointers to link the common branches of proofs together. Might have a play to see if there are additional space savings to be made.
I made a quick implementation: https://github.com/ethereum/research/blob/7db6b87cf8642a8671dd9890909586912a0929c9/mimc_stark/merkle_tree.py#L37
It seems to have reduced the size of the STARK proofs by nearly 20%, relative to what was already a pretty decent compression mechanism.
Some years ago I developed an algorithm for the merkle tree traversal problem which asks the question which nodes to store of a merkle tree between generating authentication paths for leaf(i) & leaf(i+1) and which nodes to recalculate on demand. The goal is to have a good/optimal tradeoff between required memory and required recomputation and it should be configurable. This is used for signature generation in merkle signature schemes. One component repeatedly used is the so called TreeHash algorithm which allows to calculate a node from its leaf requiring O(N) hashing and O(log(N)) memory (excluding the leafs (these are calculated in all merkle signature schemes instead of stored)). Where N is the number of leafs. Its a straight forward algorithm and probably used a lot outside of merkle signatures schemes as well. I think that by cleverly ordering the Nodes of your output this algorithm could be adapted to reduce the memory footprint and increase the performance of the verify process by recalculating the root from the proofs, proofing all proofs at once instead of calculating all the missing nodes and then proof each leaf individually. If I find time I will do an implementation of it
First compliments on the threepart Stark explainer, great work.
I finally came around downloading the python code to look at its inner workings and I think I found a weak spot.
At every stage of the recursion you sample the rows that are checked based on the merkle root of the values of the column. Because we sample only 40 rows, we can get away with quite some changes in the column without getting caught. For example, if we change 10% of the numbers in the column and do 40 samples, we have a chance of 1.4% of not getting caught. We can easily try multiple variants of the column to beat the odds.
When we start with a high degree polynomial and have 10 levels of recursion, it is possible to change all points in such a way that they are all on the same low degree polynomial once we reach the final check.
To make this attack harder we could sample the rows of all levels based on the last merkle root. In that case we have a chance in the order of 10^{19} which is much better. Turning it into an interactive proof via the blockchain is also possible, but might be less practical in some cases.
To avoid this attack altogether and also make the proof smaller, we should follow the sampled points from the base layer all the way to the final polynomial. So instead of sampling just one value on the column per check, we should sample the 3 sibling values as well, to use in the next recursion step. Skipping the second lookup per merkle tree can reduce the total size of the proof significantly.
By following these “pointtraces” all the way, we make sure they have not been tampered with. You could still try to tamper with the sibling points, but this is not useful without knowing the column that will be chosen next.
@vbuterin thanks for the code; finally had a chance to look at it and think about how it compares with pollarding.
Given that the chance of removing any given hash is related to the ratio of proofs to data values (plus level in the tree, of course) it’s no surprise the more proofs there are the better the overall reduction. At relatively low levels (1:256, or 64:16384 as was actually tested) there is little difference and depending on the encoding method for the data it can be less efficient than pollarding (as the empty placeholders still take up some space). But stepping this up to higher ratios brings in higher efficiencies, for example at 1:32 (512:16384) it is about 30% more efficient than pollarding (which is itself 60% more efficient than simple proofs). A nice increase for many realworld cases.
An interesting difference is that both generation and verification of proofs require all of the individual proofs to be present; with the former this maximises the gain and with the latter it guarantees all values are in place to enable verification. This makes pollarding potentially better for interactive proofs where the pollard can be sent once and subsequent proofs can refer to it.
The ultimate solution would be to have a progressive interactive version of your system where both sides remember the proofs that have been sent to date and so over time proofs sent can be more and more efficient as they require fewer new hashes to verify. Probably not something that would be a common requirement, however (and it comes with its own tradeoffs).
If you don’t mind I’d like to write up the solution (thinking of calling it “spare tree proofs” unless I can come up with something catchier) as a companion piece to the one I wrote on pollarding.
I agree with @daria that Reverse MiMC sounds problematic: As an example, if you specifically want the VDF for collaborative randomness, then an adversary could use the faster reverse computation. You could easily fix this with another hash function anywhere in the process, but if you want VDF proofs to be owned to reward the evaluators then the fast reverse evaluation still breaks that, and you cannot fix that so easily.
Following the values that are sampled at the base level (as I described at the end of my previous post), instead of resampling at every step of the recursion, could unlock some extra optimizations to shrink the proof further:
 Every merkle tree is sampled only once and we always collect four values. So we could put these four values together in the leaf of the tree. This would lower the depth of the tree by two levels and save some hashes.
 Increasing the degree of the polynomial on the rows from 4 to 8 or even 16 does not come at a high cost and could save a level of recursion. The number of values per level will increase, but the merkle part would shrink. Especially for levels with deep trees the trade offs will be favorable.
 At the final level of the recursion, we now publish all 256 values of the column to be able to verify the values of the previous level and check whether these values are on the same low degree (max 16) polynomial.
I don’t think we need to publish these 256 values at all. The verifier could just calculate the 40 values that should be in that column based on the values from the previous level and check whether these 40 values are on a polynomial with a max degree of 16.
Because we are sampling 40 values we might be able to stop earlier and check for a polynomial of degree 32.
For this to work we need to make sure that we will end up with enough distinct values at the last level of the recursion, so we have to adjust the sampling process a little.  Increasing the number of values at the base layer and lowering the number of samples, could shrink the size of the proof while keeping the same level of security. This is off course more computational intensive, but could have a positive trade off.
Note that this is not possible for the original version, because the attack described in my previous post would be easier, so it would lower the security of the proof.
The authentication path optimization seems similar to Octopus in https://eprint.iacr.org/2017/933 . (It may be a reinvention of the same thing.)