Blob serialisation

For test in Go, see: https://github.com/prysmaticlabs/geth-sharding/blob/master/sharding/utils/marshal_test.go

2 Likes

I think it will be good to look into using protobuf, even if we already have written implementations, and test benchmark this against our implementations. https://developers.google.com/protocol-buffers/docs/overview. It can be used for Python, Go, and Rust, plus others.

I would like to hear more about the background on why Ethereum requires its own encoding strategy rather than protocol buffers or JSON.

I would love to use protobufs (note: I am extremely biased on this.)

3 Likes

JSON Is preferable for compatibility, readability and editability, while Protobuf is preferable for performance.

@prestonvanloon @jamesra1
Since third-party formats are being discussed, it might be worth having a look at CBOR (http://cbor.io/) and its schema friend CDDL - CBOR is being considered for the libp2p stack as the go-to encoding of datagram packages (https://github.com/multiformats/multigram), potentially in competition with bencode.

1 Like

We are starting to explore protobufs: https://github.com/prysmaticlabs/geth-sharding/issues/150

We’ll post results here after we can determine if protobufs are a viable solution to data serialization for sharding.

1 Like

@arnetheduck

Can you guarantee that encoding/decoding to/from any language with protobuf, Thrift, CBOR, avro or another mechanism will not result in inconsistencies, e.g. due to different encodings and different hashes?—https://gitter.im/ethereum/sharding?at=5b0f73f2a7abc8692ef9fd9a

The alternative to RLP would have been using an existing algorithm such as protobuf or BSON; however, we prefer RLP because of (1) simplicity of implementation, and (2) guaranteed absolute byte-perfect consistency. Key/value maps in many languages don’t have an explicit ordering, and floating point formats have many special cases, potentially leading to the same data leading to different encodings and thus different hashes. By developing a protocol in-house we can be assured that it is designed with these goals in mind (this is a general principle that applies also to other parts of the code, eg. the VM). Note that bencode, used by BitTorrent, may have provided a passable alternative for RLP, although its use of decimal encoding for lengths makes it slightly suboptimal compared to the binary RLP.—https://github.com/ethereum/wiki/wiki/Design-Rationale#rlp

Since it’s being considered as a candidate for multigram and libp2p, I’ll have another look at it.

There is no requirement that all data formats be uniquely
encoded; that is, it is acceptable that the number “7” might
be encoded in multiple different ways. — https://tools.ietf.org/html/rfc7049#section-1.1

If we develop a custom mechanism, I think it should use JSON for compatibility, and also be consistent in encoding and decoding any input to the same output, irrespective of language.

I suppose the discussion should continue here: Discussion: P2P message serialization standard

1 Like

Interesting replacement for RLP, considering there’s LOTS of 32-byte data in current applications because that’s the “native int width” of EVM; and blob serialization would essentially waste half of the space when encoding them. It’s like the pessimum choice for the most common use case.

Maybe I’m mixing up the layers of the technology stack here, and this isn’t really relevant for the EVM and how it’s going to look like in ETH2?