Blob serialisation


#1

TLDR: We propose a serialisation scheme for blobs in collation bodies. Credits to @prestonvanloon for suggesting the main construction and for providing the great illustration.

Construction

Collation bodies (of size 2^n bytes for some n \ge 5) are partitioned into 32-byte chunks. The first byte of every chunk is the “indicator byte”, and the other 31 bytes are “data bytes”. We call the 5 least significant bits of indicator bytes “length bits” with value ranging from 0 to 31.

Blob data is partitioned across data bytes from left to right, with every chunk holding data for at most one blob. If the length bits are zero then the corresponding chunk is “non-terminal”, with all 31 data bytes holding blob data. Otherwise the chunk is “terminal” marking the end of the current blob, with the data bytes holding as many blob bytes (packed to the left) as specified by the length bits.

The illustration below shows a 4-chunk collation body serialising two blobs. The first (in blue) has length 32 and the second (in orange) has length 61. The white bytes are the indicator bytes, and the grey bytes are ignored.

For the purpose of blob delimitation the blob parser ignores:

  • Data bytes of terminal chunks not holding blob data
  • Chunks after the last terminal chunk
  • The 3 most significant bits of indicator bytes

The 3 most significant bits of terminal chunks are 3 blob flags. The first blob flag is a SKIP_EVM flag to avoid execution of the blob by the default EVM. The other two flags are reserved for future use.

The default EVM charges gas for blob data proportionally to the number of chunks used.

Remarks

  • The parser never throws.
  • All blobs are terminated, and have 3 flags.
  • Blobs are at least 1 byte and at most 31*2^{n-5} bytes long.
  • Data ignored by the parser can set arbitrarily, e.g. to squeeze extra data.
  • 32-byte hashes in blobs can be truncated to 31 bytes for packing into a single chunk, and witnessing with a single Merkle path.
  • The terminal chunk of blobs of length a multiple of 31 bytes can hold no blob data.

#2

In the diagram, the first byte being 00011111 is a valid way of saying “the next 31 bytes in this chunk are part of the blob, but it’s terminal”, correct?

I don’t really see any reason why not to do that.


#3

Correct.

Are you asking in regards to the remark “The terminal chunk of blobs of length a multiple of 31 bytes can hold no blob data.”? If so, I was just pointing out an edge case/gotcha for developers where there are two ways of serialising the same blob.


#4

What advantages does this scheme have over RLP encoding all blobs and concatenating them (RLP(blob) ++ RLP(blob) ++ ...) or, alternatively, RLP([blob, blob, ...])? RLP encoding seems simpler, especially considering that blobs itself will still need to be encoded in some way and RLP seems to be the obvious choice for that. Some differences I can see:

  • Chunks may contain data from multiple blobs. Is there a problem with that? I guess verifying
    Merkle proofs of inclusion gets slightly more complicated but that shouldn’t be a big deal. To avoid that one could pad with zeros (so RLP(blob) ++ padding ++ ... or padding ++ RLP([blob, padding, blob, padding, ...])
  • No space wasted for indicator bytes or post-terminal data.
  • No support for blob flags. We could introduce them by replacing RLP(blob) with RLP([flags, blob]), with the advantage of allowing for an arbitrary number of flags (three seems quite low to me if there will be multiple execution engines)
  • There can be invalid collations. Execution engines could treat those simply as empty collations.

#5

This encoding format preserves the properties that:

  • Everything is a valid encoding of something (needed to cleanly separate questions of availability and validity)
  • You can prove existence of a blob in the blob list by only providing Merkle branches to the blob contents

Justin’s scheme satisfies both properties. RLP satisfies neither. For a particularly egregious example of RLP internal ambiguity, consider something like:

0xb0afaeadacabaaa9a8a7a6a5a4a3a2a1a09f.......83828180

Any suffix of that RLP data is itself valid RLP, so a Merkle proof of membership of any supposed sub-chunk doesn’t actually validate that that chunk actually is a chunk in the full data.


#6

How exactly would you compress a pseudo-random 32 byte hash into 31 bytes?


#7

Truncated, not compressed. Just kick off the first byte.