EthStorage: Scaling Ethereum Storage via L2 and DA

Motivation

Reusing Ethereum mainnet security and extending Ethereum scalability are the major goals of Ethereum L2 solutions. Most existing L2 solutions such as rollups focus on scaling Ethereum computation power, a.k.a., higher transactions per second. Meanwhile, with the popularity of dApps such as NFT/DeFi/etc, the demand for storing a large amount of data reusing Ethereum mainnet security has grown dramatically.

For example, one strong storage demand comes from on-chain NFTs, where not only the token of an NFT contract is owned by the users, but also the on-chain images belong to the users. In contrast, storing the images on a 3rd party (e.g., ipfs or centralized servers) introduces additional trust, which can be easily and frequently broken (e.g., a lot of images of old NFT projects using ipfs are now unavailable).

Another demand is the front end of dApps, which are mostly hosted by centralized servers (with DNS). This means that the websites can be easily censored (happening for Tornado Cash). Further issues include DNS hijacking, website hacking, or server crash.

By reusing Ethereum mainnet security, all the aforementioned problems can be immediately solved. However, if everything is stored on-chain, the cost will be extremely high - for example, storing 1GB data using SSTORE will cost 1GB / 32 (per SSTORE) * 20000 (gas per SSTORE) * 10e9 (gas price) / 1e18 (Gwei to ETH) * 1500 (ETH price) = $10M! The cost can be reduced to 1/3x using a contract code, but it is still far more expensive than other storage solutions (S3/FILECOIN/AR/etc).

Goals

With L2 and data availability technologies, we believe that we can achieve an Ethereum storage scaling solution with the following goals:

  • Increase the capacity to PB or more assuming that each node has a few TB disks
  • Reduce the storage cost to 1/100x or 1/1000x vs SSTORE
  • Similar KV CRUD semantics as SSTORE (a few limitations will apply, see below)
  • Reuse Ethereum mainnet security on block re-organization, storage cost settlement, and censorship-resistant

Solution

How? The current cost of storing a large amount of data on the Ethereum mainnet comes in two parts

  • Upload cost (calldata)
  • Storage cost (SSTORE)

For the upload cost, with DA, especially danksharding, we are expecting the upload cost will dramatically be decreased in a near future - e.g., the current draft of EIP-4844 will have 1 gas per byte (i.e., ~128KB per BLOB given a BLOB size is ~128KB) and minimal gas price to be 1. The throughput of uploading can be further increased to ~10x by danksharding, which should further reduce the cost of uploading data.

For the storage cost, our solution is to build a permissionless L2 data retention network. The L2 network contains the following components

  • a storage contract deployed on Ethereum mainnet, which offers KV CURD semantics such as put()/get()*/delete() besides verify(). The storage contract does not store the full values of the keys - only commitments (e.g., KZG commitment of BLOBs) are stored, while the corresponding data is available thanks to DA. The storage contract will also accept proof of storage and efficiently verify that data is stored in the data nodes in an L2 data retention network. (get()* is only available in data nodes, see below)
  • data nodes run a special client of Ethereum (a modified version of geth and a consensus client), which synchronizes the latest state of Ethereum (and thus all commitments of the values of KV pairs). Further, the data nodes serve additional functions as
    • accept the configuration of which parts (i.e., shards) of BLOBs will host
    • synchronize the BLOBs of interest by joining the L2 data retention network
    • copy the BLOBs from the Ethereum mainnet DA network if the corresponding commitments in the storage contract are updated/appended in the storage contract
    • generate proof of storage, submit it to Ethereum mainnet, and collect storage fee
    • serve storage_contract.put() in JSON-RPC eth_call() method

Note that running a data node is completely permissionless - as long as a data provider has sufficient disk space, it could run a node, synchronize the BLOBs from L2 network, copy the BLOBs from DA if the commitments on L1 change, and prove the retention to L1.

Benefits vs Existing Solutions

Besides reusing Ethereum mainnet security, EthStorage can offer the additional benefits:

  • Rich storage semantics (KV CRUD). FILECOIN/AR mostly works for static files, which lack efficient update/delete operations - i.e., the users have to pay twice to update existing data. Thanks to DA and smart contracts, EthStorage can offer full KV CRUD semantics similar to SSTORE.
  • Programmability. The storage can be programmable by smart contracts, which can easily enable new features easily such as multi-user access control or data composability.
  • Atomicity with application logic and storage logic. Current dweb using ENS generally requires two steps: 1, uploading the data to an external storage network; 2, storing the contenthash on ENS. With DA and EVM, EthStoage can complete both application logic and storage logic in a single transaction, which is more friendly to users (also widely found in Web2 applications, e.g., Twitter/FB/etc).
  • Zero-onboarding cost. EthStorage is built on top of Ethereum, the storage cost is also paid by ETH, and thus, the storage operations can be done just by ETH wallet like Metamask - users do not have to learn new token/wallet/address.

Key Problems to Solve

  • Proof of storage on large dynamic datasets with data redundancy: Enabling data redundancy in a decentralized way is a key challenge, especially the dataset changes constantly.
  • Efficient recurrent storage payment: Submitting a valid proof of storage on-chain will reward the prover (or data providers) with a storage fee (in ETH). We need an efficient storage rental/payment model to ensure sufficient redundancy of the storage data perpetually.
  • Rarity data discovery and token incentive to encourage auto re-replication: When some of the data nodes are off, we need to encourage other nodes to join the data retention network and replicate the data
  • Efficient verification on-chain: We will explore some techniques, especially zero-knowledge proof to reduce the cost of verification.

Other Questions

  • Q: What is the difference with DA?
    A: Current Ethereum DA is expected to expire the data in a few weeks or months (See Proto-Danksharding FAQ). EthStorage is expected to store the data permanently given some well-known assumptions on storage cost (e.g., storage cost over ETH price constantly drops every year). Further, EthStorage offers full KV CRUD semantics on-chain (note that read is limited to only data nodes).

  • Q: What is the access protocol to read the storage EthStorage (similar to ipfs://)?
    A: The data can be retrieved by calling eth_call on a data node, which will search and retrieve the corresponding BLOBs in L2 network. Further, from an end-user perspective, we could use web:// access protocol to browse the BLOBs hosted by a smart contract, whose content can be dynamic.

4 Likes

Hello Qi,

I think that your proposal can produce a useful system indeed.

That said, I don’t see an analysis of the issues with it. I would be curious to learn what you consider future roadblocks here, or items for future research.

For one, I know that Filecoin has trouble serving hot-storage items. IIUC this is partially because its SNARKs are quite expensive. On this note, I don’t see an analysis of plausible proof techniques, their performance, and how each of them impacts the hot storage capabilities of the system. Such a system needs strong confidence that the data exists and can be served back in time. An analysis of this part of the design would be useful.

Cheers!

1 Like

Thanks Qi, this is really interesting. I like that you’re exploring different pricing models rather than going with a permanent storage model (which is plaguing the L1).

Seems to me that this can be a helpful tradeoff - I essentially can pay less but have less redundancy of my data. Maybe there are 100 replicas of my data instead of 400K on L1. Not unlike sharding but for only the storage resource. This seems reasonable given it’s probably a 1/n honest trust assumption.

A couple of questions

  1. Do you expect the new price of storage to be roughly replica count / L1 validator count? Or do the savings scale more than just with the number of replicas?
  2. As a full node, how do I validate the state posted to the L1 by the L2? If I don’t store the data myself, seems like I have to trust someone. I suppose they can gossip witnesses but I’m not able to validate every transaction unilaterally
  3. How do you view the tradeoffs between this approach and the weak stateless model? If we end up with a centralized handful of block builders, do we still gain much scalability?
2 Likes

Many thanks for your comments, George! Indeed, the problem of proving the storage on a large dynamic dataset with sufficient redundancy in the network (e.g., 30~50 replicas) on smart contracts is the probably most challenging problem here. The current solution of EthStorage is inspired by both proof of replications from FILCOIN and proof of random access from AR.

To prove the exact replications of the data, i.e., to prove n physical replicas of the data in the network, the basic idea of FILECOIN will ask each data provider to store a sealed version of raw data as

D_i = seal(D, pk_i)

where D is the raw data, pk_i is the key of the i th data provider, and D_i is the sealed replica. The sealing procedure is very time-consuming, taking about 6 hours on a high-performance machine for 32GB of data. Then, a time-limited interactive random challenge is performed so that pretending to store multiple physical replicas by computing the replicas with raw data on demand is impossible.

In the EthStorage case, we cannot use the FILECOIN solution directly since the raw data changes constantly. Instead, we use the idea of approximate proof of replication. First, for each BLOB stored in the network, we will ask to store the physical replica as

D_i^{(j)} = D^{(j)} \oplus mask(j, commitment(D_i^{(j)}), pk_i)

where D^{(j)} is the j th 128KB BLOB (with zero padding), and mask() is an electric-power expensive function, e.g., multiple rounds of Ethash, but can be computed in sub-seconds (so that we could update physical replica in a timely matter).

Now, for each Ethereum epoch (32 * 12 = 384s), given a random seed from prevRANDAO, we will ask each data provider to perform a large random sampling N_s, with each sampling corresponding to a hash candidate (like PoW). If a hash candidate satisfies the difficulty d (in the storage contract), then a valid proof of storage is found and the data provider can submit the proof to the Ethereum mainnet and collect the storage rental fee. The difficulty parameter will be adjusted on-chain similarly to Ethash difficulty adjustment algorithm but with a much larger expected submission interval T_b, which can be a couple of hours.

A couple of further comments on the idea:

  • Given difficulty d, we could estimate the number of physical replicas as d / T_b / N_s
  • The parameters such as N_s, mask function mask() should be chosen such that the cost of physical storage (including power cost + storage device lifetime cost) is much lower than the cost of computation on demand (i.e., raw data storage cost + computing mask() on demand + computation device lifetime cost). I have a simple calculator using typical Ethash mining data.
  • Further, the mask() function should be easily and efficiently verified on-chain using ZKP. This area is open to research, and the current plan is to use a modified Ethash with Posiden hash to generate the initial DAG.
  • A further extension can allow the data provider to store a part of all BLOBs, i.e., some shards of the BLOBs. This can be done by partitioning all BLOBs into non-overlapping shards (e.g., 4TB or 8TB size so that a commodity can store at least one) with each sharding having independent mining parameters (difficulty, prevhash, prev_submission_time, etc) in the storage contract.

I am happy to know if you have further questions or suggestions :slight_smile:

Hi Lydia, many thanks for your questions. Please see below for my answers.

I would expect the new price of storage will not depend on L1 validator count but depend on gas price and replica count. The EthStorage costs on Ethereum mainnet mainly consist of two parts: 1, put() method of storage contract to store a BLOB commitment and other metadata; and 2, mine() method of verifying proof of storage on-chain.

For put() method, it should be only charged based on the gas price * constant gas per metadata of a BLOB (commitment, kvIdx, etc) + a continuously-discounted upfront storage cost, where the upfront storage cost is a pre-setup parameter to ensure an expected number of replicas (with some assumption of profit margin, storage lifetime/power cost, etc)

For the proof of storage verification cost, the amortized cost of verifying the storage of each BLOB should be very small considering the proof is succinct thanks to ZKP and infrequent (average proof submission interval is targeted to be a couple of hours)

Yes, that is the goal of the proof of storage part (see my above response to George’s question), where the network can show the proof on-chain that the data has been replicated with expected physical replicas. Further, such physical replicas must match the latest BLOBs that are posted on L1 (since we have all commitments of all BLOBs). Note that the availability of storage data is 1/N honest assumption, so we could safely trust on-chain storage proof as long as the proof can tell that there are sufficient replicas in the network.

This is a very good question, especially with the development of verkle tree in the Ethereum roadmap that we are closely watching. Note that one goal of EthStorage is to support ~PB capacity assuming each data node/provider can permissionless join the storage network with a few TB (4TB or 8TB for the current design). Further, if we can relax the CRUD semantics (e.g., put() and remove() with additional proof) so that we only maintain a single commitment on-chain for all BLOBs (instead of each commitment for each BLOB), the capacity should be theoretically unlimited (or limited by ETH supply). This may form a more decentralized storage network vs a centralized handful of block builders, where each may need PB capacity to generate verkle tree witnesses.

Hope this can answer your questions. Feel free to let me know if you have more :slight_smile: