Explaining the liveness guarantee


#1

One of the aspects that seem to surprise many people that only casually follow the progress of Ethereum Serenity is the quadratic leak that is imposed upon validators for being offline and missing a slot. For those unwilling/unable to read the Casper FFG paper, I wrote up a quick explanation of how one could arrive at this solution using fairly conventional wisdom from distributed systems in computer science.

I am curious what others think of this explanation. Any and all notes/insights/feedback are welcome.

Liveness

One of Serenity’s main goals is to guarantee liveness (i.e. continue to finalize blocks) in the event of a major internet partition (e.g. World War 3). This liveness guarantee comes at a steep cost which makes it important to understand the predicament and possible tradeoffs.

The CAP Theorem

The CAP Theorem for distributed systems, tells us that:

You can’t simultaneously guarantee more than two of the following:

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
  • Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network

The Assumption

By viewing the argument through the lens of the CAP Theorem, we can deduce the rationale for the inactivity leak by accepting the following assumption:

No network can guarantee message delivery because the network itself is not safe from failures (e.g. client disconnects, data center outages, connection loss).

Partition Tolerance

Since message delivery cannot be guaranteed, the logical thing to do is to tolerate prolonged message loss. This is equivalent to Partition Tolerance.

Sidenote: Think of the World War 3 scenario as a dysphemism for prolonged message loss between groups of validators.

With Partition Tolerance as a hard requirement, we are now limited to tradeoffs between Consistency and Availability.

World War 3

In the World War 3 scenario, where the network is severed, the validators are split into two partitions. From Casper FFG, we know that in order for both partitions to continue finalizing blocks, we need two-thirds majority of validators to be online in both partitions. This is obviously not possible; however, we can prevent the chain from stalling forever if we are willing explore a compromise between our Availability and Consistency guarantees.

The Compromise

This is accomplished by introducing an inactivity leak that drains the deposit of unresponsive validators each time a slot is missed until the remaining validators in each partition become the supermajority.

At this point, blocks in both network partitions can begin to finalize; however, if the network partition is healed we are left with two valid and separate networks.


#2

It would take ~ 13 days for any single network partition for it to be mathematically possible for either partition to begin finalizing blocks assuming each partition contains 50% of the active validators each. Although this would require 100% of the validators in each partition to attest to the same blocks which would be very unlikely. Basically the only time we can realistically start finalizing blocks would be ~17 days.

I guess the assumption would be that if there were any WW3 event where the network was partitioned, it would be resolved in less than 17 days.


#3

I feel that 50/50 network partitions/splits are massively overrated as a threat. When has this historically happened, anywhere, and not been resolved soon? What would even be a coherent story by which two parts of the world stay coherent internally but communication between them is not possible? The incentives for maintaining global communication are massive, and there’s no reason why parts of the world that are still capable of maintaining communication internally would not be able to figure something out to talk to each other within a week or two. The thing that’s much more likely, whether in normal life, or due to government censorship, or due to a war, is nodes going offline, either because something happens to their operators or because their operators get cut off from the entire internet.

The inactivity leak is primarily there to deal with this “3/4 of the network goes offline at the same time” risk.


#4

Nice post, I love plain English. :slight_smile:

It will take some time for this to start happening (around 2 weeks, AFAIK). If that massive partition isn’t resolved within that period, the split becomes final/irreversible.

Totally agree.

I was always inclined towards consistency in general, but this is an extremely strong point.

Btw, I don’t quite understand why the term “liveness” (instead of “availability”) is almost always used on Ethresearch and elsewhere? My understanding was that “safety” and “liveness” are terms mainly related to FLP Impossibility? Or it simply doesn’t matter? :slight_smile:


#5

Thanks for the feedback!

This is a good suggestion…I am going to add some more detail about the partition, how long until the the chain can begin finalizing again, etc.


#6

Good point…I think this example will resonate with people much better than a doomsday scenario. My goal is essentially to help people understand that these decisions were the end result of a logical and pragmatic train of thought. I attempted to address this in my sidenote where i call the WW3 scenario a dysphemism, but i think i need to augment this with some version of your explanation.