Specification for the Unchained Index Version 2.0 - Feedback welcome

If the user wants to access traces (or any data older than XXX blocks, for that matter), they will always need access to an archive node. We don’t pull any data from the chain other than appearances. This is by design in order to minimize the size of our extracted database. The node itself is a pretty good database if it has an index. But, all is not lost, with a good index, even a remote RPC endpoint is quite usable. In fact, the amount of queries one must make is greatly reduced. Plus, all of our other tools have a built-in, client-side cache, so one need never make a second query of the remote RPC making it even better. The single most important goal, from the start, was that things worked on small machines. Every byte counts.

Only traces require an archive node. Everything else (block header, transactions, receipts) is available with a full node (block history, transaction history, receipt history).

The single most important goal, from the start, was that things worked on small machines.

If you require an archive node, then this doesn’t work on a small machine as archive nodes don’t work on small machines.

I think your model is generally good, but I’m now quite concerned about the inability to disable the trace requirement. You would catch probably 99% of appearances without traces (just by looking at headers, transactions and receipts), and the disk requirement of a full node is about an order of magnitude lower than the disk requirement of an archive node. While I often complain about the size of Ethereum full nodes and the fact that most users cannot run them, requiring an archive node essentially forces you into having a dedicated server.

The use case I’m most interested in is the ability for a user to get their transaction history (appearance history would be even better) without needing to outsource to a third party like Etherscan or run their own multi-terabyte servers with indexes. I feel like your proposed solution here is really close to providing that, but only if traces can be either disabled or flagged in the index so they can be ignored by the vast majority of people who don’t have Ethereum archive nodes.

Note: I am not a fan of the solution of “just use a third party hosted archive node” as that is a point of centralization that I think we should try to avoid, and I’m also not a fan of “just by a 4TB drive to store Ethereum state history on” as that puts the solution out of reach of essentially all consumers (even consumers with high end computers).

The above is all doubly true if you want the index to be built and maintained within existing clients. If it only works with an archive node, I think that is a non-starter as the assumption is that almost no-one runs an archive node, so the index wouldn’t be useful to all of those people.

Do you have any data on what percentage of appearances would be missed if you dropped the transaction tracing? I would be curious to see that data with “app” contracts filtered out (e.g., ignore Uniswap internal stuff). It would be even cooler if we could somehow figure out how to filter out bots (e.g., MEV bots). My guess is that once you filter out apps and bots, the number of addresses that appear only in traces and not in transaction body or events, is vanishingly small and not worth the archive node requirement.

Historical state as well, which is important to us as “reconciliation of historical state changes” is (one of) our “raison d’etre”.

This is definitional. By small machine, I mean a late-model iMac laptop (M1 or M2) with at least a four-TB hard drive (8TB preferred). Running either Reth or Erigon fits on either of these machines, as I demonstrated in the above-referenced video.

Because at its base we only index “appearances,” the end user can determine for themselves if they need traces. We certainly need them to create the index in a way that is as complete as possible (because without them, one simply cannot reconcile state changes). But, the end user can choose to ignore appearances that don’t appear in “regular places” (for example, as part of a logged event). Also, our scraper can be very easily modified to run against a node that does not provide traces. It would be a much-reduced index, but it would work perfectly well.

I’m not at all sure this is the case. (We should actually do a study of this, but we’ve not yet done that.) I suspect it’s much lower than that. In fact, the reason we had to “resort” to traces is because nothing reconciled without them. And, I think, this is exactly the reason why automated off-chain accounting (or tax reporting) is in such an abysmal state. People ignore traces and things don’t come into balance (off-chain).

Totally agree here. In fact, so much so, that I made this: chifra scrape - should allow for building an index that ignore traces · Issue #3408 · TrueBlocks/trueblocks-core · GitHub.

Again, I couldn’t agree more. I think in the end what we will find is that the state needs to be sharded, so, in much the same way that we “shard and share” the Unchained Index via chunking and a manifest on IPFS, the state will be chunked, sharded, and shared on some sort of content-addressable store as well. (And, it can use the index to get to the right portion of the state.)

There’s a recent, related post about that here: Trustless access to Ethereum State with Swarm

1 Like

Two things: (1) I agree, (2) the index can be used by people without an archive node, they would just have to ignore those “appearances” that we inserted from traces as not applicable if they couldn’t retrieve traces. It would still work, it would just be less effective.

Two issues from one post. Nice!

Does the index indicate whether a given entry comes from a trace vs somewhere else? Or would you have to attempt to look up the transaction and then assume that if the address wasn’t present it means it was in a trace you don’t have (and not a bug in the index)?

If the index has that information then most of my complaints go away, other than the inability to initially build the index without an archive node, which can be solved by a “trusted syncing/trustless following” model.

1 Like