Thanks! On my machine (Intel Core i9-9980KH @ 2.4 GHz) I’m seeing 1.3 ns / mul for your code.
I’ve also benchmarked a rudimentary implementation of Montgomery multiplication in a field with modulus 2^{61} + 1025 \cdot 2^{40} + 1, and I’m seeing about 1.3 ns / mul as well. For Montgomery multiplication I’m assuming the values are already in Montgomery form - so there will be an additional overhead of converting to/from Montgomery form, though, in practice, this overhead should be negligible.
However, since the field I’m using also supports delayed reduction, I’ve benchmarked that as well, and I’m getting about 0.9 ns / mul. So, it seems like Montgomery multiplication with delayed reduction outperforms both Crandall method and plain Montgomery method by about 30%.