After some time I have returned to this topic as and looked at the paper again and agree with eezcjkr, that wen have to hash the max-keys as well if insertion & deletion proofs should be supported in addition to membership & non-membership.
As in many applications the keys are hashes including them in the node hash will double the space required for a proofs ( 1x Key + 1x Node Hash per Proof Node).
So I tried to understood the intuition of the algorithm a bit better with the goal of reducing the space requirements for insert & deletion proofs. Here a a description from another perspective:
If we have a classical Sparse Merkle Tree and want to convert it into a Compact Sparse Merkle Tree. We can do this by replacing every subtree (non-leaf node) where its left or right child but not both is a non-default node (non-empty node/leaf) by its non-default child, and repeat this until no more replacements are possible. This eliminates all default nodes (unless the whole tree is empty) and each leaf ends up in its corresponding subtree. The remaining tree nodes are augmented by setting on each a key (or max-key) property to max(left.key, right.key). If I’m not mistaken this would result in the same tree as if all the nodes would have been inserted over the proposed algorithm (or did i describe something different?).
The alternative construction I came up works as following. When starting with the classical Sparse Merkle Tree we augment all non-leaf nodes with a property called height which is the height of the subtree it is the root of (height(leaf) = 0, height(node) = height(node.left)+1 = height(node.right)+1). Now we do the same compaction as before but we do no longer need to add the max-key property.
Note that we only need n bits to encode the height of a tree containing keys with 2^n bits (The extra information is much smaller: 1 Byte for 32 Byte keys).
To traverse this tree at each node we look at the bit at position n in the key we are looking for (counting from the lsb) where n is the height-1 of the current node. If that bit is 0 we go left and if that bit is 1 we go right (similar to a classical Sparse Merkle Tree but we can skip sparse parts, but have to record heights as trade-off) if we reach a leaf we check if lookup key == leaf.key.
I claim that these two are equivalent because the decision to go left or right depends on a single bit in the lookup key which is the height and is derivable from the left & right max key.
(Bittpatterns of keys, || = concatenation, x = shared prefix l & r)
l = node.left = x||0||z (x & z sequences of bits)
r = node.right = x||1||y (x & y sequences of bits)
k = lookup key = a||b||c (a & c sequences of bits, b single bit)
distance(l,k) = bxor(x,a)||b||bxor(z,c)
distance(r,k) = bxor(x,a)||not(b)||bxor(y,c)
Note: both start with bxor(x,a) and thus this part has no influence on which is the bigger number. the log was ommited as it has no influence on which is the bigger numbe. It follows that:
if b > not(b) then bxor(x,a)||b||bxor(z,c) > bxor(x,a)||not(b)||bxor(y,c)
if b < not(b) then bxor(x,a)||b||bxor(z,c) < bxor(x,a)||not(b)||bxor(y,c)
Meaning the decision to go left or right is only dependent on the bit b and the position of bit b can be derived from l & r to be more specific it is rounded down distance(l,r), which is the first non-shared bit, which is the height -1 of the node ( as in a Sparse Merkle Tree each bit of a key corresponds to one level in the tree)