Torrents and EIP-4444

parithosh · June 12, 2024, 9:35am

Torrents and EIP-4444

Introduction

EIP-4444 aims to limit the historical data that Ethereum nodes need to store. This EIP has two main problems that require solutions: Format for history archival and Methods to reliably retrieve history. The client teams have agreed on a common era files format, solving one half of the problem. The second half of the problem, i.e Method to reliably retrieve history will likely not rely on a single solution. Some client teams may rely on the Portal network, some rely on torrents, others might rely on some form of snapshot storage.

Torrents for EIP-4444

Torrents offer us a unique way to distribute this history, torrents as a technology have existed since 2001 and have withstood the test of time. Some client teams, such as Erigon already include a method to sync via torrents that has run in production systems.

In order to make some progress on the Torrent approach of history retrieval, the files would first be required. So an era file export was made on a geth running version v1.14.3 . To explore the initial idea, the torrent approach chose pre-merge data as a target. The merge occurred at block height 15537393, meaning all pre-merge data could be archived by choosing a range of 0 to block 15537393. The era files were then created using the command geth --datadir=/data export-history /data/erafiles 0 15537393.

Once the era files were created, they were verified using the command era verify roots.txt, with the source of the roots.txt file being this. The entire process has been outlined in this PR comment. The verification output was found to be this log message: Verifying Era1 files verified=1896, elapsed=5h21m49.184s

The output era files were then uploaded onto a server and a torrent was created using the software mktorrent. An updated list of trackers was found using the github repo trackerslist. The trackers chosen were a mix of http/https/udp in order to allow for maximal compatibility. The chunk size of the torrent was chosen to be 64MB, which was the max allowed and recommended value for a torrent of this size.

The result of this process is now a torrent of size 427GB. This torrent can be imported with this magnet link and a torrent client would be able to pull the entire pre-merge history as era files.

Tradeoffs

There are of course some tradeoffs with torrents, as with many of the other EIP-4444 approaches:

Torrents rely on a robust set of peers to share the data, there is however no way to incentivise or ensure that this data is served by peers
A torrent client would need to be included in the client releases and some client languages might not have a torrent library
Torrents would de-facto expect the nodes to also seed the content they leech, this would increase node network requirements if they choose to store history
The JSON-RPC response needs to take into account that it may not have the data to return a response in case the user decides to not download pre-merge data

Conclusion

A client could potentially include this torrent into their releases and avoid syncing pre-merge data by default, which could then be fetched via torrent if a user requests it (perhaps with a flag similar to --preMergeData=True). The client could also hardcode the hash of the expected data, ensuring that the data retrieved matches what they expect.

Instructions for re-creating torrent:

Sync a geth node using the latest release
Stop the geth node and run geth --datadir=/data export-history /data/erafiles 0 15537393 to export the data in a folder called data/erafiles(Warning, this will use ~427GB of additional space)
Use the mktorrent tool or the rutorrent GUI to create a torrent. Choose the /data/erafiles/ folder as the source for the data. Next, obtain the latest open trackers from this github repository. Choose a healthy mix of udp/http/https trackers and choose the chunk size of the torrent to be 64MB.
The tool should output a .torrent file, the GUI will also allow you to copy a magnet link if that is required

Instructions for download and verification of torrent data:

Download the torrent data with this magnet link and in a torrent client of your choice: link
Clone the latest release of geth and install the dependencies
Run make all in the geth repository to build the era binary
Fetch the roots.txt file with the command: wget https://gist.githubusercontent.com/lightclient/528b95ffe434ac7dcbca57bff6dd5bd1/raw/fd660cfedb65cd8f133b510c442287dc8a71660f/roots.txt
Run era verify roots.txt in the folder to verify the integrity of the data

imkharn · June 12, 2024, 5:27pm

“there is however no way to incentivise or ensure that this data is served by peers”

Why would a client team use torrent? Incentivized data storage exists and better ones are being developed. Additionally more efficient ones exist where only a fraction of the data is stored on each node with essentially the same chance of the data being available.

arnetheduck · June 13, 2024, 6:08am

In order to be able to verify the data in the torrent, one needs access to roots.txt which actually is an accumulator similar to the one found in the consensus beacon state.

For verification purposes, it would be best if this file was included in the torrent as that would allow consumers of the data to hardcode a single hash-of-the-accumulator which in turn can be used to verify the era1 file contents, reducing the number of moving parts further down to a single hash - this hash could then be distributed together with other mainnet metadata, for example in the mainnet repository that establishes configuration parameters.

kdeme · June 13, 2024, 7:45am

this hash could then be distributed together with other mainnet metadata

And there is EIP-7643 which defines this hash in order to be able to verify the whole list of roots.
It would be good to have it in the mainnet metadata indeed.

Also the individual roots are specified in EIP-7643.

parithosh · June 13, 2024, 8:43am

Could you elaborate on what incentivized data storage method could be used instead?

The essential issue with incentivising data storage is that the user would then have to pay for data access, which isn’t currently the paradigm if you run your own node (other than hardware costs ofc)

parithosh · June 13, 2024, 8:49am

I like the approach of having the EIP-7643 roots available as mainnet metadata (although I am not sure EL teams currently subscribe to the mainnet repo the same way CL devs do, maybe this will get them to start). Additionally getting verification data from a different source as the source data seems like a good idea.

arnetheduck · June 13, 2024, 9:13am

The only verification data you need in this case is the hash from EIP-7643, namely 0xec8e040fd6c557b41ca8ddd38f7e9d58a9281918dc92bdb72342a38fb085e701 .

This hash allows you to verify proofs.txt which in turn allows you to verify the rest of the era archive (strictly, proofs.txt shouldn’t be needed either, but I suspect it’s convenient to have it to make the process of finding errors in the files more fine-grained) - there exists no benefit whatsoever of getting this file from a separate source (as long as you have the above hash) - it belongs inside the torrent.

This is also the problem with the idea of using a torrent, which is why it’s somewhat unattractive to clients: there is a structural mismatch in the verification mechanism used - the hashes of the torrent itself don’t line up with hashes used in ethereum, which is why we need proofs.txt to begin with - this makes partial verification somewhat involved (because as you’re downloading the files in the torrent, you need to verify both the torrent hashes and the ethereum proofs).

That said, I do believe there’s utility in socially coordinating around a single torrent file for convenience of testing, if nothing else

parithosh · June 13, 2024, 4:06pm

hmm, Yeah I’d also tend to agree that the proofs.txt won’t be needed at all. Wouldn’t we just rely on the inherent integrity offered by the torrent in that case? Changing any byte in a torrent should invalidate the whole torrent, so as long as we agree that the torrent values were checked before the torrent was included in a client release - we wouldn’t need any additional verification.

arnetheduck · June 13, 2024, 7:24pm

We ideally want to reach a place where “most” pieces of the ecosystem use the same verification method - both portal and era1 natively use the proofs.txt verification that work with native ethereum algorithms for hashing etc - torrent adds an alien component to the security in the form of its own verification - this is not great, because it doesn’t allow connecting it to era1, portal and other pieces.

My point with proofs not being needed strictly is that the one hash in EIP-7643 combines all the hashes in proofs.txt - you don’t need proofs.txt unless you want to verify smaller chunks of data - adding proofs.txt to the torrent allows you to verify each era separately instead of having to verify all 400+GB at a time.

agree that the proofs.txt won’t be needed at al

this is not quite what I meant - I mean it’s not strictly necessary from a security point of view, but it is very useful to have inside the torrent file for convenience of verification.

imkharn · June 13, 2024, 8:04pm

Noting that I have not surveyed the incentivized file storage field since 2019… these schemes usually work by sending out challenges to people who are supposed to be online. They are required to produce the data sample or they get slashed. Considering blockchain data is public (relative to people storing personal and corporate files) , it may need to be an encrypted fragment so that they cant pass the challenge by looking up the data online. The most efficient decentralized data storage (as measured by amount of data duplication redundancy needed for a given percent chance of losing data) is probably still Storj Understanding File Redundancy: Durability, Expansion Factors, and Erasure Codes - Storj Docs.

Regarding covering the cost of the data storage, I see 2 options:

Force. Enshrined so nodes are required by threat of slashing to store a fraction of the old data. Challenges constantly spot check.
Paid. The inefficiency from overduplication of data already exists, Consider that nodes might already be willing to pay a small amount to free up hard drive space. Since data availability needs to be forced either way, this is more of an extension to the force suggestion whereby nodes can declare how much space they want to give to the network. If they offer above/below par some financial value is transferred.

parithosh · June 14, 2024, 9:28am

Okay, that makes sense - I’ll have to recreate the torrent file with the proofs.txt, I can do so and update the post!

parithosh · June 14, 2024, 9:33am

Thank you for the links!
My main criticism of that approach is implementation complexity. Maybe the protocol adds such logic in the future, but I don’t see us adding such complexity in the base layer for a few more forks (years) at this point.