Rate-limiting entry/exits, not withdrawals

Currently, validators are able to enter and leave the validator set relatively quickly: each time a validator set transition happens, 1/64 of the validator set can switch in or out, and so in the normal case, every validator can switch out within a day. There is a much slower queue for withdrawing, to prevent validators from all withdrawing as soon as they perform a large-scale attack before they can be penalized. I argue that this current status quo is suboptimal, and we should set the withdrawal delay to a minimal constant (eg. 1 day) and instead use a much more conservative bound on entry/exit rate to serve the same function.

I claim this is a good idea for a few reasons:

  • CBC Casper compatibility: part of CBC philosophy is that the chain does not need to converge on one specific mechanism to determine canonical ā€œfinalityā€; applications can choose what thresholds they use. The current validator set change mechanism (1/64 every time the chain finalizes) requires the chain to have a canonical finality oracle.
  • Light client friendliness: having a much slower bound on validator set change makes it easier for light clients to skip ahead a relatively long distance at a time.
  • Ability to resume finalization via a surge of friendly deposits: if there are not enough validators that are online, a fixed rate limit on entry/exit, that does not require the chain to finalize to proceed, would allow altruistic ETH holders to swoop in and join the validator set to cause it to resume finalization.
  • Resistance to discouragement attacks: a discouragement attack (link to mini-paper here) involves an attacker (with >=33% stake) causing a medium amount of disruption to consensus, with the purpose of making it unprofitable for others to validate. This drives others to leave, making further attacks cheaper and more profitable. Making it simply not possible for validators to leave quickly (intuition: you joined the army, the fort is under attack, you have to stay around to defend it, being on-call for that sort of thing is the job description!) is the best known strategy to increase the cost of discouragement attacks.

The alternative is simple to implement:

  • Reduce the maximum number of validators that can enter or exit from 1/64 of the total to a much lower fraction, or even a square root or a constant (think: 1-3 months to rotate the entire validator set)
  • Repurpose the withdrawal queue into a entry/exit queue. Repurpose exit_slot as the slot when either (i) a deposit was processed or (ii) an exit was triggered. Post-exit withdrawal is now a fixed length of time (possibly extendable with proofs of custody)
3 Likes

Very cool. The main tradeoff that jumps out here is the increased cost validators bear in the case they want to stop validating - and itā€™s not even increased capital lockup costs, but rather the fact that they have to stay online and keep generating messages.

Are there any other disadvantages this scheme has over the previous one that Iā€™m missing?

Ability to resume finalization via a surge of friendly deposits :

It seems like the answer is obviously no, but is there any case where ā€œlots of deposits can enter without finalityā€ actually makes things worse? Iā€™m wondering if it might make some weird discouragement attack more powerful (or something), as this change expands the attackers set of strategies when finality is not being reachedā€¦

Very cool. The main tradeoff that jumps out here is the increased cost validators bear in the case they want to stop validating - and itā€™s not even increased capital lockup costs, but rather the fact that they have to stay online and keep generating messages.

Capital lockup costs are unchanged; itā€™s the stay-online requirement thatā€™s stronger. Though itā€™s worth keeping in mind that the current system also has a stay-online requirement, as you need to respond to proof of custody challenges.

1 Like

Sure. This strengthens the stay-online requirement though, as validators must be online per-epoch as compared to per-response-deadline.

Agree! Though OTOOH validators can earn revenue throughout this extra time period, whereas in the current design theyā€™re just sitting in limbo.

2 Likes

I think this is interesting and it would seem to me that the pros outweigh the cons.

On the economics side, weā€™d essentially be taking staking from a money market type product to a smaller term product. Although, it sounds like in your proposal thatā€™d be only a month or so? Iā€™d say from a validator standpoint and attractiveness of investment, theyā€™d only demand slightly more interest. The one caveat is that since the staking rate is floating, a validator technically does not know what his interest rate will be over period of lockup. Probably a smaller concern but one Iā€™d say is worth considering.

What weā€™re seeing in open finance so far is that instant lending is not offering a very big return, because the borrower set is much smaller than the lending set. I see some fixed term products coming online that are hoping to increase that rate but Iā€™d expect almost all validators to think of staking as a longer term investment anyways. As long as this is known up front I donā€™t really see much concern.

1 Like

I think the staking community would tolerate a reasonable delay in withdrawal time. Choosing to stake is not like having funds in a savings or checking account, itā€™s closer to a deposit into CDs & T-Bills. Itā€™s understandable that a withdrawal would not be instant.

Off the cuff, a 1 to 2 day withdrawal period seems fine. 3-4 days, starts to feel like hardship. 5 days or longer would be quite the ask. Depending on the length, you might expect staking pool services to crop up that provide immediate withdrawal liquidity to customers as a service or for a fee (an unintended centralization vector perhaps? You could imagine stakers moving to a Coinbase or RocketPool for their ā€œinstant withdrawal timesā€).

Of course, all of these dynamics (reward rate, slashing risk, withdrawal time) will play into the decision to stake & weighed against alternative uses of the capital, or more specifically alternative uses of the Eth in the open finance world.

1 Like

What would happen here is that 1 day would be the ā€œhappy caseā€ withdrawal time, and something like 3-6 months would be the ā€œworst caseā€ withdrawal time if everyone is trying to withdraw at the same time. Iā€™m expecting the happy case to be the normal case, but still trying to come up with better ways to reason through that.

1 day withdrawal time is completely fine, imo. The 3-6 months is a very long time if we look at it from a market perspective for obvious reasons. Though, I canā€™t imagine many scenarios (besides actual attacks) where everyone would be trying to withdraw at once.

In saying this, I speculate that a lot of the people who will be staking are those that are currently just holding their ETH in cold storage so this long withdrawal period may not be an issue for them.

I also imagine that there will be some sort of derivative product that tracks the underlying ā€˜withdraw in progressā€™ ETH so that people could ā€˜offloadā€™ their stake before the withdrawal is finalized.

Also, do validators have to wait until the withdraw is finalized before they can re-stake their ETH?

I did a simulation of actual withdrawal times assuming two simplifications: (i) a Zipfā€™s law (ie. power law with power=1) distribution of validator deposit sizes, (ii) each validator has a 1/D chance of deciding to start exiting any given day.

Hereā€™s the code: https://github.com/ethereum/research/blob/2d3ed6e42087d5b14cdf107c897e8d3e5db3ee7a/exit_queue_tests/exit_queue_tester.py

The results were fairly bimodal: if you set a rule that the entire validator set can withdraw after N days (ie. 1/N of the set per day), and D > N, then we get validators being able to withdraw almost instantly. Hereā€™s N = 180 and D = 360:

Total delays in days
21759:  11.318 (min 12.534)
10879:  6.528 (min 6.267)
5439:  3.643 (min 3.133)
2719:  1.995 (min 1.566)
1359:  1.088 (min 0.783)
679:  1.010 (min 0.391)
339:  0.989 (min 0.195)
169:  0.836 (min 0.097)
84:  0.942 (min 0.048)
42:  0.866 (min 0.024)
21:  0.925 (min 0.012)
10:  0.928 (min 0.006)
5:  0.933 (min 0.003)
2:  0.943 (min 0.001)
1:  0.952 (min 0.001)

(delays are sometimes lower than the minimum because the minimum is calculated based on the target total deposit size which doesnā€™t perfectly match the actual one)

Now hereā€™s D = 180, N = 240 (only the top 5 rows for compactness):

21759:  12.667 (min 12.534)
10879:  8.149 (min 6.267)
5439:  5.059 (min 3.133)
2719:  3.866 (min 1.566)
1359:  3.003 (min 0.783)

Now D = 180, N = 180:

21759:  26.618 (min 12.534)
10879:  25.083 (min 6.267)
5439:  23.793 (min 3.133)
2719:  22.048 (min 1.566)
1359:  22.355 (min 0.783)

And D = 180, N = 120:

21759:  76.153 (min 12.534)
10879:  73.770 (min 6.267)
5439:  74.317 (min 3.133)
2719:  74.556 (min 1.566)
1359:  74.288 (min 0.783)

This makes me think that things will be fine but we would benefit from some explicit policy to discourage validators from exiting too quickly. Perhaps have the exit queue favor ā€œolderā€ validators in some way.

Took a while to collect my thoughts. Hereā€™s my take on this proposal.

I think 1/N of the set per day, with a 2-3 day min delay is good. But, I think thereā€™s a case to increase the min delay to 3-5 days. I would argue that a longer minimum delay constant (e.g. 3 days) actually helps self-select validators that are aligned with the ā€œdefend the fortā€ intuition, and thus are more prone to be agreeable to the 1/N limit.

Hereā€™s how I would construct the argument:

  • Using a lower min constant (e.g. 1 day) sets up an anchoring bias that will lead people to underestimate how much of a commitment validating can be. This may cause some validators to regret their decision, which can result in poor retention.

  • For example, econoar, ryanseanadams and sassal have commented to the effect that 1 day is painless, this despite the delaying effects of the 1/N limit (i.e. everyone thinks theyā€™ll manage to be the first out). So, I argue that if the min constant has no bite, it wonā€™t be taken into consideration when deciding to stake or not, and thus might recruit less responsible validators (e.g. screw the fort, Iā€™m just passing through, SO LET ME OUTTA HERE!).

  • If we want to balance the instantaneous security of more validators with the extended-time security of more fort-defenders, then I suggest we choose a goldilocks value for the min constant. This feels to me like 2 days minimum, but my preference would be for 3-5.

  • ADDENDUM: I feel that the choice of N is less sensitive as it will likely be discounted by most non-robotic validators. The anchoring effect of the min constant seems to be more important, at least while validators are mostly human.

I think 1/N of the set per day, with a 2-3 day min delay is good

Sorry, by delay here do you mean exit delay or withdrawal delay? Note that the proof of custody game already forces a 2 day min delay after youā€™ve exited.

ADDENDUM: I feel that the choice of N is less sensitive as it will likely be discounted by most non-robotic validators. The anchoring effect of the min constant seems to be more important, at least while validators are mostly human.

I donā€™t think itā€™s quite so simple. Humans are affected not just by numbers, but also by scare stories. And hearing someone say on reddit ā€œOMG my exit took 49 days!!!1!ā€ vs ā€œOMG my exit took 176 days!!!1!ā€ do have significantly different effects.

I assumed a variable exit delay and a fixed withdrawal delay. I wasnā€™t aware that the custody game added 2 days delay as well. Hsiao-Wei Wangā€™s recent slides make it appear as if this only for Phase1. Either way, my concern was only with the possibility of 1 day of total delays, from voluntary exit to withdraw-able state. That seems to not be the case. If it totals to a minimum of 2-5 days, I think thatā€™s great.

re: long exit delays, if Iā€™m interpreting your code (above) correctly, it looks like month long delays will be exceedingly rare. To test this, I extended your python code to run a monte carlo simulation and count the number of n-day delays that occur on the network. Looks like on average, only one staker every 2 years should experience more than a 30 day delay in the absence of a black swan (based on your starting assumptions).

Hereā€™s the extended code:

And the results (key is the # of days delayed, value is the mean over 1000 trials):

So, if you add to these averages a total of 2-3 days to account for the custody game and minimal withdrawal delay, then I think this is looking good (at least w/ these assumptions). :+1:

More thoughtsā€¦

An interesting reason to support rate-limiting exits rather than withdraws is that it boosts incentives for validating nodes to be physically decentralized as a way of avoiding slashing.

Hereā€™s a non-exhaustive list of scenarios that promote decentralization of nodes under a rate-limited exit scheme by making validators avoid being effected by the same environmental factors of other validators:

  1. Power grid issues (e.g. rolling blackouts or national power outage).

  2. Connectivity issues (e.g. the internet gets shutdown in country X).

  3. Oppressive regulation (e.g. Crypto is made illegal in country X).

  4. Geo-political strife in region X produces one of the above.

  5. A natural disaster and/or global warming produces one of the above.

Counter-argument: Validators shouldnā€™t be penalized for environmental factors outside their control.

Counter-counter argument: The priority of the network should be to keep nodes up. Better to have fewer nodes in unstable regions and more nodes in stable ones. Validators in these regions can also move their operations elsewhere if disaster strikes.

Counter^3 argument: Not all people in these regions will have the freedom of mobility to physically relocate their operation.

Counter^4 argument: Hopefully there will be easy to use migration tools for validators to move their operation into the cloud ahead of some pending environmental danger.

My problem with this is that people may become aware that they will be unable to validate because of a hardware / network outage, etc.

In that case, they would be penalized for not being online / validating, which may be of no fault of their own.

Because of this, my preference would be that someone can stop validating quickly but withdrawals are rate limited.

The counter-argument to that would be that in the normal case, where the outage only hits you and a few other validators, you still will be able to exit quickly; the only case where you canā€™t is if the outage hits everyone at once. But if the outage hits everyone at once, itā€™s your duty to try really hard to get your node back online; thatā€™s what you signed up for by being a validator.

Another interesting alternative is a queue where the rate of processing depends on the existing queue length. This seems like it reduces volatility of exit times.

Here are results from simulating a queue that always allows a maximum of 1% per day of exits, assuming the average validator sticks around for k days, so think of 1/k as the rate of churn (if k < 100, then the queue is filling faster than it is clearing, though there is a natural upper bound as in the limit all ETH is stuck in the queue):

k Avg delay
200 0.5
150 1.4
120 3.7
110 5.6
100 13.6
90 21.4

Now, letā€™s make the withdrawal rate proportional to log(len(exit_queue)), with parameters targeted to keep the average withdrawal delay at k=120 unchanged.

k Avg delay
200 0.8
150 1.7
120 3.7
110 5.0
100 8.4
90 11.7

We could go further, and use sqrt(validator_count * len(exit_queue)). Then we get:

k Avg delay
200 2.0
150 2.9
120 3.7
110 4.1
100 4.5
90 5.0

Now we can see some stability. We can clearly tune the tradeoff in whatever way we like.

I think 99% of times, weā€™ll be in a ā€œhappyā€ case. The ā€œmayhemā€ case is when a disaster (or massive FUD, black swan event) happens and almost every validator wants out immediately. For example somebody yells FIRE in a crowded theater with only one exit. The only model I can think of is everyone squeezed up at the door trying to get out.

I think there gotta be incentives for people who stick it out when the network really needs it. Some rewards which could even incentivize brand new validators to rush in to help (and stay online).

Btw is this representative of the validators entry and exit steps?

One small detail. The ā€œcanā€™t get penalizedā€ can go up one level next to ā€œreceives back the stakeā€. A withdrawable validator is not slashable.

Updated