Yes - will report back on this shortly.
The program flow has to actually use the cores available to it to get improvements, with goroutines/channels/waitgroups/… structured in a pipeline matching the cores nicely – so if you have a machine with 64 cores with only 3 go routines running computation (1), (2) and (3) in parallel, you aren’t going to do much better than with just 3 cores. On the other hand, if you have a machine with 8 cores, having 128 goroutines running through 128 pieces of a for loop will have worse performance than if you just have 8 goroutines. To minimize the time of STARK proof generation, its essential to take the computational dependency graph, break it into stages, and focus on the key bottlenecks of the longest route. This is what the pipelined version of your STARK looks like in my Go re-implementation, with sample times attached:
The longest route is currently (2a+b+c) => (4a+b) =>(7) => (8) => (9) => (11) => (12) (shown in red). Inside the longest route are the time consuming FFTs + inverse FFTs (in step (2b)+(2c)], multi_interp_4 (in step (11)) – I did make a first pass at parallelizing these internal operations, but I’m certain we can do a lot more.
Here is the verification part, much easier to reason about:
Most of the time, breaking up a for loop into N goroutines [as in (v4)] is totally worth it (for some N, but less worth it for 2N), but sometimes its not (because the overhead of the goroutine is just too high) – so it makes for a lot of empirical engineering. With gradient descent in the parameter space, we can have confidence there is another 25-40% speedup to be squeezed out of the proof timing. We’ll try out beefy machines with 24-32 cores for comparison this month and report back.