Blob serialisation

That would be done by the length bits in the first byte of a chunk, no? If non-zero, the length of the data in the 31 bytes is the length of the length bits. E.g. 10 = 10 bytes. The rest of the bytes are ignored.

Right, but aren’t you proposing to remove that and have a single indicator byte for each blob to explain how many chunks it spans and whether or not should be executed via EVM? Or what is the optimization?

I edited my comment above, let me know if it’s still not clear.

Any comments, anyone? @prestonvanloon

When I wrote “The first blob flag is a SKIP_EVM flag” I mean that there’s only one SKIP_EVM flag for the whole blob (namely the first flag of the first chunk, which happens to be the first blob flag). The other flags for other chunks are ignored by the parser and can be set arbitrarily, e.g. to squeeze extra data.

BTW, I’m now not sure the SKIP_EVM flag is a good idea because it complicates anti-replay logic to avoid non-EVM transactions being executed as EVM transactions and vice versa.

2 Likes

Good point, didn’t think of that! If we don’t use a SKIP_EVM flag I’ll need to rewrite https://github.com/Drops-of-Diamond/diamond_drops/pull/67.

I’ve just been writing up a way to serialize collation bodies that correspond to the same blob into a BlobBodies struct, however I see now that this struct would need to contain a blob hash, and hence when a blob is serialized into a chunk it’s contents should also be hashed, and that hash should then be included into each collation body. However, a problem with that is that there may be more than one blob in a collation body, but not in a chunk. Additionally, it adds a verification overhead to check that each Body in BlobBodies has the same blobhash. So there would need to be a solution for that, and due to the Verifier’s Dilemma, this would be infeasible to do on the blockchain (at least as it is currently), but could be done offchain with Truebit. Of course, the fallback is that there is just no ability to put blobs on the blockchain that are bigger than a megabyte, which limits the usability of Ethereum, particularly for big data, etc.

1 Like

Blob serialization release: https://github.com/Drops-of-Diamond/diamond_drops/releases/tag/v0.3.0-a

2 Likes

Are there any test cases that client implementations can use to conform the the specification? e.g. Ethereum tests for RLP https://github.com/ethereum/tests/tree/develop/RLPTests @JustinDrake @vbuterin

@tim you probably already know, but there are lots of unit tests for blob serialization in our repo here. Just clone the repo, install cargo, cd node and run cargo test node, or run all the tests as in the readme. But having a common test suite for sharding would be good. We could potentially use Rust for that, either directly via bindings with Rust’s FFI, or using the C ABI. Also converting to Wasm and then to JS is another possibility that has apparently been made rather convenient.

1 Like

For test data also see https://github.com/ethereum/py-evm/blob/master/tests/core/test_blob_utils.py#L33

There are no standardized JSON tests for this yet.

1 Like

For test in Go, see: https://github.com/prysmaticlabs/geth-sharding/blob/master/sharding/utils/marshal_test.go

2 Likes

I think it will be good to look into using protobuf, even if we already have written implementations, and test benchmark this against our implementations. https://developers.google.com/protocol-buffers/docs/overview. It can be used for Python, Go, and Rust, plus others.

I would like to hear more about the background on why Ethereum requires its own encoding strategy rather than protocol buffers or JSON.

I would love to use protobufs (note: I am extremely biased on this.)

3 Likes

JSON Is preferable for compatibility, readability and editability, while Protobuf is preferable for performance.

@prestonvanloon @jamesra1
Since third-party formats are being discussed, it might be worth having a look at CBOR (http://cbor.io/) and its schema friend CDDL - CBOR is being considered for the libp2p stack as the go-to encoding of datagram packages (https://github.com/multiformats/multigram), potentially in competition with bencode.

1 Like

We are starting to explore protobufs: https://github.com/prysmaticlabs/geth-sharding/issues/150

We’ll post results here after we can determine if protobufs are a viable solution to data serialization for sharding.

1 Like

@arnetheduck

Can you guarantee that encoding/decoding to/from any language with protobuf, Thrift, CBOR, avro or another mechanism will not result in inconsistencies, e.g. due to different encodings and different hashes?—https://gitter.im/ethereum/sharding?at=5b0f73f2a7abc8692ef9fd9a

The alternative to RLP would have been using an existing algorithm such as protobuf or BSON; however, we prefer RLP because of (1) simplicity of implementation, and (2) guaranteed absolute byte-perfect consistency. Key/value maps in many languages don’t have an explicit ordering, and floating point formats have many special cases, potentially leading to the same data leading to different encodings and thus different hashes. By developing a protocol in-house we can be assured that it is designed with these goals in mind (this is a general principle that applies also to other parts of the code, eg. the VM). Note that bencode, used by BitTorrent, may have provided a passable alternative for RLP, although its use of decimal encoding for lengths makes it slightly suboptimal compared to the binary RLP.—https://github.com/ethereum/wiki/wiki/Design-Rationale#rlp

Since it’s being considered as a candidate for multigram and libp2p, I’ll have another look at it.

There is no requirement that all data formats be uniquely
encoded; that is, it is acceptable that the number “7” might
be encoded in multiple different ways. — https://tools.ietf.org/html/rfc7049#section-1.1

If we develop a custom mechanism, I think it should use JSON for compatibility, and also be consistent in encoding and decoding any input to the same output, irrespective of language.

I suppose the discussion should continue here: Discussion: P2P message serialization standard

1 Like

Interesting replacement for RLP, considering there’s LOTS of 32-byte data in current applications because that’s the “native int width” of EVM; and blob serialization would essentially waste half of the space when encoding them. It’s like the pessimum choice for the most common use case.

Maybe I’m mixing up the layers of the technology stack here, and this isn’t really relevant for the EVM and how it’s going to look like in ETH2?