Cross Shard Locking Scheme - (1)

Iā€™m going to try to re-express this in my own framework.

Suppose there is a transaction that needs to access resources R[1] ā€¦ R[n] on shards S[1] ā€¦ S[n] (weā€™ll say for simplicity one resource per shard; the generalization to multiple resources in one shard is obvious), where the ā€œhome shardā€ is S[1]. First, a transaction needs to be included in every S[i], which calls the LOCK opcode with arguments (T, R[i], S[1]), for some transaction identifier T. This locks resource R[i], preventing its state from being modified, and targets the lock to S[1]. Every shardā€™s state maintains a data structure {shard: locks}, storing for every other shard the set of locks that are targeted to that shard.

It is understood that in order to execute the state at block N of any shard S[i], the client needs to have available the state roots of block N-10 of every other shard, and must also download the entire currently active set of locks targeted to S[i] of every shard (this can be trivially downloaded and authenticated against the state roots of the other shards). Now, when a transaction whose hash is T executes, it can read all locked resources R[1] ā€¦ R[n], perform some computation, and then suggest new contents for the state of each resource. The shard creates a ā€œrelease lockā€ message, targeted to each shard, which also specifies what to set the resourceā€™s new state to; additionally, the shard stores in its own internal state a record saying that that instance of a resource lock cannot be used again. Whenever this release message is included into any other shard via a Merkle proof, that shard sets the new state of the resource to the given value, and the lock is deleted.

This does require any node executing a block on a shard to be aware of more data from other shards, but it does seem to work.


An alternative mechanism, where execution can depend on state roots only (reminder: for simplicity Iā€™m that state roots can be used to Merkle-prove logs, so we donā€™t need to worry about state and receipt roots separately), is as follows:

  1. The resource must be locked on every shard, and the lock generates a simple log that specifies the target shard and transaction ID.
  2. The receipts proving that all locks have been made need to be included into the target shard.
  3. The transaction on the target shard issues logs, that contain the new values for the locked resources.
  4. The values can be unlocked and set to their new values on the original shard by including the receipts generated by the target shard.

This keeps the model of what cross-shard knowledge is required for executors to have the same as it was before, and pushes the work of actually performing the cross-shard operation to users.

I actually like this; it seems fairly simple.

2 Likes

This is similar to the method used in OmniLedger. This works well for applications where the user is locking a state that only they can modify, such as spending a UTXO, because if they lock a UTXO that belongs to them, but then do nothing, then the user is only harming themselves. But if the user locks an object that can be accessed by anyone, such as a hotel ticket that can be bought by anyone, then if the user locks a hotel ticket and does nothing, then no one else can buy that hotel ticket. Pushing the cross-shard operations to clients means that we also rely on the client to be honest to guarantee liveness. In 2PC-style protocols (i.e. S-BAC), the liveness property relies on the shard being honest instead.

1 Like

Not necessarily. Remember that whoever wants to buy the hotel ticket next can just submit all of the required transactions to unlock the object themselves. Though there definitely are complexities around gas accounting here that need to be understood much better.

What happens when there is a deadlock then, and two conflicting locks are held on two different shards? We either to trust that the clients will resolve this on their own, or have a timeout for locks to deal with bad clients as proposed in the original post, which means no one can buy that hotel ticket for the duration of the timeout.

Indeed there are questions about gas accounting, i.e. should we charge for failed transactions? If we donā€™t charge for failed transactions, then an attacker can keep locking an object for free, preventing anyone else from spending it, under this model.

A timeout is not strictly speaking necessary, if we use a wound-wait approach which could be executed by the shards.The shards should be resolving deadlock, even if they are not doing most of the work acquiring locks.

Is that similar to the wound-wait mechanism proposed? There is still a question what happens if T_j never releases its locks in case 2, no?

If the shards should be resolving the deadlock, that begins to look like 2PC. :wink:

1 Like

What happens when there is a deadlock then, and two conflicting locks are held on two different shards?

How is that possible? Control of what a resource is locked to is based on the shard on which the resource is located, so you canā€™t have two contradictory locks.

Yes, definitely would want to borrow that part from your scheme. Itā€™s clever. :wink: However, will say that I think the benefit of a wound wait is that not all locks beed to be acquired in the same round.

Suppose two transactions T_1 and T_2 might both need resources A and B. T_1 has a lock on A and T_2 has a lock on B in different shards.
We need to find a way for resolving these deadlocked transactions.
One way is to say: all locks are sent to all shards, then deadlock can be avoided. However, that would reduce the benefits of scaling as data for all transactions would be sent to all shards. So instead, we just send lock data to relevant parties, but then we need a way to resolve deadlock.I suggested a wound-wait approach.

Letā€™s say you have a transaction T1 that needs access to two resources, located on two different shards, such that it needs to do the following operations:

  1. LOCK (T1, R[0], S[0])
  2. LOCK (T1, R[1], S[1])

Then letā€™s say another transaction T2 comes in that wants access to the same resources:

  1. LOCK (T2, R[0], S[0])
  2. LOCK (T2, R[1], S[1])

Then letā€™s say this is what actually gets executed:

  1. LOCK (T1, R[0], S[0])
  2. LOCK (T2, R[1], S[1])

Now, neither T1 or T2 can proceed, hence deadlock.

1 Like

Ah, I see. I think something like two-phase commit should solve this right? So expanding my proposal from above:

  1. All desired resources R[1] ā€¦ R[n] on shards S[1] ā€¦ S[n] must be locked, and each lock generates a log that contains the transaction ID, target shard and the complete list of resources and shards needed; the tuple (tx ID, target shard, resource list) is called the ā€œlock IDā€.
  2. The log in shard S[i] needs to be included in all shards S[1] ā€¦ S[n] except S[i] itself.
  3. Suppose that any shard S[i] receives an attempt to lock a resource, with lock ID L, but that resource is already locked. Then, shard S[i] issues a log that can be used in other shards to repeal any lock with lock ID L.
  4. Suppose that a shard S[i] discovers that it has received a notification of a lock of all resources R[1] ā€¦ R[n], all with the same lock ID, and it does not see that any of those resources have been unlocked because of a conflict. Then, it issues a log confirming the locks, which can then be read by the target shard.
  5. If every shard S[1] ā€¦ S[n] issues a confirmation log, then the target shard can execute the transaction, and issue logs containing the new state of every resource.
1 Like

Yes, thatā€™s almost how Sharded Byzantine Atomic Commit works, but these corner cases are accounted for. :slight_smile:

1 Like

Ok, reading that paper now.

My current understanding is: if an attempt is made to lock a resource on some shard with some lock ID, and this attempt fails because the resource is already locked, then this issues a log that can be used to unlock any resources in other shards that were locked in the same lock ID. If a single shard receives notifications of logs from all shards, then it knows that there are no shards that could have issued a log that would repeal any of the other locks, and so it can safely publish a confirm message.

1 Like

That sounds about right, I would add that a failed attempt to lock a resource on some shard with some lock ID, specifically means that some shard has produced a log to lock one of its resources to some shard ID, but another shard has produced a log to lock another resource required by the same lock ID, to another lock ID, thus deadlock. This is something that could happen if for example these two shards produce a block at the same time with conflicting locks. These two shards, upon receipt of a log from each other to show they have conflicting active locks, should release the locks in their respective lock IDs (abort transaction).

Given that shards have an honest majority, that would guarantee liveness of the protocol. To reduce the chance of repeated aborts (i.e. due to a malicious peer that is carefully timing the sending of conflicting transactions to different shards), we can perhaps add some non-enforced ā€˜etiquetteā€™ consensus rules, for example, an implicit ordering can be applied on the set of shards, and if there were two previously aborted conflicting transactions, the shards will try again, sequencing the transaction that was locked by the shard with the highest order.

1 Like

Folks - I would like to support what Mustafa is saying ā€¦

The fundamental decision is whether the master that manages all of the locks is a smart contract that lives on one of the shards or a user.

If it is a smart contract, then its operation is guaranteed. This means that:

  • when all the slaves tell the master they have locks
  • the master will tell back to all the slaves to commit and
  • the slaves will commit.

The beauty of all of this is that all parties are smartcontracts, so operation of each of them is guaranteed. So there is literally no way go get to a problem like a stale lock, unless the source code of one of the smart contracts is screwed. So it is much much better than regular database transactions where a particular database can crash which can lead to a stale protocol, and then you have to introduce timeouts.

In the protocol above there are no timeouts and no need for them.

If the user manages the locks, then as Mustafa is saying, the user can simply withold messages, because he is a bad guy, or because he crashed. In this case the entire system gets stuck, and the only way to resolve this is to introduce timeouts, which are evil things for concurrent systems, because you then need to constantly think about one timeout interacting with another timeout etc.

Also for the system above the only abstraction that needs to exist in the EVM is a notion of asynchronously reliably sending messages from one party to another.

In fact, this system will be the most reliable distributed locking system in the history of the universe, because particular databases in distributed database systems can crash, and shards cant totally crash, so it will be a very useful product

The way it needs to work, is there are two master smart contracts M1 and M2.

  1. Both smart contracts send messages to resources A and B to acquire a lock.

  2. M1 acquires lock on A and M2 acquires lock on B

  3. M1 gets a message that locking B failed, and M2 gets a message that locking A failed.

  4. Both transactions fail, M1 sends a message to A to unlock, and M2 sends a message to B to unlock.

So there is no deadlock.

To summarize, the simplest protocol is:

  1. The user send a message to the master smart contract to initiate the transaction.

  2. The master smart contract sends a message to slave contracts on all shards to lock resources.

  3. Each slave contract either locks the resource and replies to the master ā€œOKā€, or fails to lock, and replies to master ā€œSorry, already lockedā€

  4. If the master gets all OKs, the master tells all slaves ā€œproceedā€ and they proceed with transaction.

  5. If the master gets at least one ā€œSorryā€. then it tells all slaves ā€œabortā€

In this procedure, the complexity is N where N is the number of slaves. The master can live on any of the shards or on main chain. Since all messages are reliable, the system will never deadlock unless one of the shards is Byzantine.

Itā€™s worth noting that at this point I consider yanking to be a superior alternative to all of these locking techniques. So the process becomes:

  1. User sends a message to resource A to yank it to the same shard as resource B
  2. Resource A arrives at the same shard.
  3. (i) user does an intra-shard tx involving both A and B, or (ii) someone else uses resource A in some way, including possibly yanking it to another shard, whichever comes first

Thereā€™s no concept of ā€œlockedā€ and ā€œunlockedā€; every resource is at all times either ā€œlocked toā€ a particular shard (by being located inside of that shard) or ā€œin transitā€ (unusable by any shard).

ā€œAs an example, a hotel booking contract might work by having a function, reserve(), that reserves a hotel room, instantiates a contract that represents the hotel room, and that contract then contains a move_to_shard(uint256 shard_id) function which allows anyone to yank the contract to another shard.ā€

Interesting ā€¦ I think it can be reformulated in terms of locks.
What you mean by yanking is equivalent to creating a shared lock.

  1. Alice sends a message to the hotel booking contract to lock a room. The hotel booking contract creates a lock receipt.

2 The receipt is shared in a sense that anyone (not only Alice) can use it it

  1. Then both Alice and Bob can send a lock receipt to the train booking contract to book a train. The good thing is that even if Alice abandons after locking a room, Bob can use her lock receipt to either book a train or release the lock on the room back to the hotel contract. So there is no deadlock.

A problem potentially though is that since Alice does two actions, Eve can potentially interfere. For instance Alice books a room, and Eve immediately uses Aliceā€™s receipt to buy a refundable train ticket, so Eve essentially does a DoS on Alice. Another problem is that some of the code needs to run in Aliceā€™s client, so essentially developers will have an additional piece to code.

A modified version, which does not have this problem would work like that:

  1. Alice sends a message to the hotel contract with a pass-through request for a train ticket.

  2. Hotel contract locks the room, and sends a cross-shard message the train contract to buy a room.

  3. The train contract either issues a ticket or rejects. It then replies to the hotel contract.

  4. When the hotel contract receives the message from the train contract, it either completes the reservation, or rejects.

This is very much equivalent to yanking imho but does not involve users running any glue code. A good thing about this is that it only involves two messages.

In any case any locking scheme can be implemented as a library on top of asynchronous messaging ā€¦ So EVM may just need to implement asynchronous message send/receive calls ā€¦