Supporting decentralized staking through more anti-correlation incentives

ComfyGummy · March 27, 2024, 5:55am

(Disclaimer: I am a home staker.)

Thanks again for engaging on this issue, it is appreciated. I think Ethereum’s correlation penalty is one of its best staking-decentralization incentivization mechanisms, and IMO it is under-utilized and could be doing so much more.

I’ve actually pointed this out in an earlier post that uses this mechanism to encourage adding proper protocol-legible metadata about validators. The details are in the post, but the tl;dr is to reduce the correlated failure penalty if the validators had voluntarily declared themselves as run by the same operator, and/or increase the correlated failure penalty if they had not.

More generally, I’d love a discussion on how the design space of the correlated-failure mechanism can be enhanced if validators had protocol-legible ownership metadata attached to them.

I’d also like to point out something about one of the assumptions from the OP, which I believe to be overly reductive:

While I am a home staker (not a large staking operator), I have experience running online services at scale and I can attest that the hardware cost of replicating identical physical setups is only a tiny part of the economies of scale that large service operators enjoy. The other large-but-scalable cost that stakers have are in the form of “devops” or maintenance work. This is work like:

Monitoring the node and validator daemons (alerts, dashboards).
Responding to incidents quickly and effectively.
Having redundancy built into the system.
Rotating keys (not necessarily validator keys; can also be OpenSSH keys to the servers themselves).
Hardening servers and protect against intrusion (especially relevant for validators as they have hot keys in them).
Handling software upgrades (of the Ethereum software or just the regular software on the server it runs on).
Moving accumulated ETH rewards for safekeeping, DeFi, or to new validators and spinning those up.
Running sidecar software like mev-boost and ensuring its uptime and redundancy (multiple relays etc.)
Implementing long-term fixes to reliability problems once they occur: disk-almost-full alerting, automated node pruning, automated fallbacks to secondary node software.
Continuous integration infrastructure and tests for all of the above to ensure new replicas of this entire setup can be spun up on demand.
Continuous deployment infrastructure to keep said infrastructure in sync with the intended configuration.

This complexity needs to be solved only once per staking operator, regardless of how many replicas of the physical setup exists. For a large operator, this is typically kept as a “configuration as code” setup (think Ansible/Kubernetes/Terraform configs) that define how to create and configure servers to have node software running and all the monitoring and security infrastructure around it, and paying humans to make themselves available around the clock should a problem happen. Once that is in place, from that point onwards, spinning up new physical replicas of this infrastructure becomes trivial. There is a large economy of scale realized in being able to reuse the same configuration-as-code setup to spin up the new replica.

Home stakers do not have the time or resources to invest into having setups of this level of reliability, so making multiple replicas (e.g. at a friend’s house) is a much more laborious process, and incident response is slower. Anecdotally, I personally only have email alerting if my validator starts missing attestations, and I usually only have time to look into it on the next weekend.

For this reason, I disagree with the assertion that “if [a large staking operator] ends up [making an independent physical setup], it would mean that we have completely eliminated economies of scale”. It helps, but it doesn’t completely eliminate them.

(This, by the way, is one reason why Verkle trees are an exciting development for home stakers: they reduce not just the hardware requirements, but more importantly the number of possible failure modes that Ethereum node clients can have, thereby reducing the relative effectiveness of the reliability work going into large-scale staking operation.)

I was also going to ask the question about how MaxEB changes things, but looks like the mini-FAQ you added covers that.