Scenarios
- Existent imbalanced data/gas spent on shards
- Migration of data from one shard to another for other technical/business reasons
- New data from off-chain sources / Eth1
For the purpose of solving this problem we will use dType (dType (Decentralized Type System) on Ethereum 2.0) and the fact that it now has a count
function for stored items. The count
function is not ideal, because it does not measure the actual storage cost, but it gives enough approximation for the current purpose.
Problem Statement
Given a number of shards shard_count
, each loaded with a certain shard_load
and a number of dType storage contracts dtype_count
, each with a certain dtype_load
, we need to find a way to balance the loads across shards.
The result is a list of shards, each with a list of dtype IDs that should be added to that shard and the final data load of that shard.
Solution
The Python code for this solution is: (it can also be read at https://github.com/pipeos-one/dType/blob/f28bc63377f0565b1809c4dd4842242ce25dbd73/docs/research/Data_Load_Balancing_of_Shards.ipynb)
import random
shard_count = 20
dtype_count = 50
# # Increase average_coef if there are not enough shards for all dtypes
average_coef = 1.3
max_shard_load = 400
max_dtype_load = 2000
# Initialize random load values for shards and dtypes
shard_loads_initial = list(enumerate([random.randrange(i, max_shard_load) for i in range(shard_count)]))
dtype_loads_initial = list(enumerate([random.randrange(i, max_dtype_load) for i in range(dtype_count)]))
shards = [[] for i in range(shard_count)]
next_index_s = 0
next_index_dt = 0
last_index_dt = len(dtype_loads_initial) - 1
last_index_s = len(shard_loads_initial) - 1
# Sort loads: ascending for shards, descending for dtypes
shard_loads = sorted(shard_loads_initial, key=lambda tup: tup[1])
dtype_loads = sorted(dtype_loads_initial, key=lambda tup: tup[1], reverse=True)
# Calculate average count per shard
average_load_shard = (sum(i[1] for i in dtype_loads) + sum(i[1] for i in shard_loads)) / shard_count
average_load_shard *= average_coef
print('average_load_shard', average_load_shard)
# Move heavier than average dtypes on the least heaviest shards
for i, dload in dtype_loads:
if dload >= average_load_shard:
shards[next_index_s].append(i)
next_index_s += 1
next_index_dt += 1
# Pair heaviest dtypes with lightest shards
# and add as many light dtypes on top, as possible
for i, dload in dtype_loads[next_index_dt:]:
if last_index_s < next_index_s:
print('Needs more shards. Increase average_coef');
break
# Add the next heaviest dtype to the next lightest shard
shards[next_index_s].append(i)
# Add as many light dtypes as the average_load_shard permits
load = shard_loads[next_index_s][1] + dload + dtype_loads[last_index_dt][1]
while last_index_dt > next_index_dt and load <= average_load_shard:
shards[next_index_s].append(dtype_loads[last_index_dt][0])
last_index_dt -= 1
load += dtype_loads[last_index_dt][1]
next_index_s += 1
next_index_dt += 1
if next_index_dt > last_index_dt:
break
print('(shard_index, shard_load, dtype_indexes)')
final_shards = [(shard_loads[x][0], sum([dtype_loads_initial[dtype_index][1] for dtype_index in shards[x]]), shards[x]) for x, _ in enumerate(shards)]
print('final_shards', final_shards)
Conclusions
We may need more data about gas/storage costs for this algorithm to be effective. Extended usage of dType will also improve the outcome.