A solution proposal to Vitalik's challenge: how to hide pigeonhole collisions (useful for Verkle trees, erasure codes, many ethereum contrstuctions, etc)

Vitalik recently tweeted this problem

Cryptography problem: make an invertible, easy-in-both-directions function from [0…2^256 - 1] to [0…2^256 - 2] (or more generally, -k for small k) where the exceptions to the invertibility are computationally infeasible to find.

Vitalik’s motivating use cases

A really efficient solution could actually significantly simplify Verkle trees, erasure codes and many other constructions in ethereum, by making 32-byte chunks one-to-one mappable to F_p elements.

The author attempts a solution with efficient verification, but one that requires preprocessing, and therefore is only as infeasibly hard to crack as the preprocessing How to hide pigeonhole collisions in 1 tweet

Expanded Solution:

Let function f1(x)=S1(x) yield a set of pigeon hole placements like f1(x) for any pigeon x
Let x be an integer, an index, into the range of interest, e.g. [0…2^256]
Let S1(x)=the sum of prime factors P1(x) for large integer L1(x)
S1 is a set of such sums
L1 is a set of such large integers
Let function f2(x)=S2(x)-k for other large integer L2(x) in set L2
k = the number of extra “pigeons”, which is 1 in Vitalik’s example
Let integers set D=S2-S1, so that D(x)=S2(x)-S1(x) for P2(x) L2(x) and P1(x) L1(x)
D is the a set of integers to shift pigeon positions from f1 to f2 and vice versa
C is a set of integers we add to D to resolve unwanted collisions into unused holes for any pigeon x
Let F(D,C,f1,f2,x) be the easily bi-invertible function from f1 to f2 and vice versa
Assign very large integers to sets L1 L2 at random
Let boolean b = (f1==f2+D-C-k) = collision if b is true
For easy verification of b, we can simply give away the prime factor keys

We sum up the prime factors in a large number set L to get sums S for pigeon hole placement set f.

As extra unwanted collisions can occur, we can resolve these into unused holes and keep track of this in C(x) for each D(x)

Keeping prime P’s a secret makes it hard for attackers to verify S and delta D.

We can make verifying S & D easy by just giving these P prime keys away for sets P1 P2. This reduces the complexity to mainly just multiplying primes in P to verify that they give L with S and D. C(x) can be checked for any collision offset to add to delta D(x).

A more verbose blog is linked here

-Gary Bernstein

I’m sorry, I don’t follow the notation. Can you explain what is x, which objects are numbers, functions, or sets of numbers?

1 Like

Vitalik’s problem sounds very hard to me, though I’m not much of a cryptographer. Pseudorandom permutations already seem difficult to construct. If you started there and modified it, it seems like your program would have to somehow encode 256 bits specifying which number gets double-pigeoned, which doesn’t seem good, unless maybe you explicitly base it on N without revealing its factors N=pq.

Under “Expanded Solution” I now explain the objects. Please LMK if any Qs. Thanks!

Please see “Explanded Solution” for set C which contains any offsets needed to resolve unwanted collisions

I’m afraid I don’t understand this solution either. Let me give my own thinking on this problem with some discussions I had with some other people at UIUC.

This problem seems very connected to the PPAD complexity class. You can read more about that class on the wikipedia page, but the key connection I see is that the “hard problem” in the class is the “End-Of-the-Line” problem.

G is a (possibly exponentially large) directed graph with every vertex having at most one predecessor and at most one successor. G is specified by giving a polynomial-time computable function f(v ) (polynomial in the size of v ) that returns the predecessor and successor (if they exist) of the vertex v . Given a vertex s in G with no predecessor, find a vertex t≠s with no predecessor or no successor. (The input to the problem is the source vertex s and the function f(v )). In other words, we want any source or sink of the directed graph other than s .

After bringing this up in a slack channel, James Hulett suggested that the critical question here is not whether EOL is hard, but whether it’s “hard on average”. This is because there is always an adversary that hard codes the colliding pigeons, so we have to think about making an efficiently-sampleable distribution of EOL instances for which the colliding pigeons are hard to find. This paper apparently gives an explicit distribution for this, but I’m not familiar with it.

The best solution that I could come up with that I think a general ethresearch audience might understand was the following (which is admittedly not good enough to be useful, but maybe it’s a starting point for understanding EOL better). It uses the formalism of the Sperner’s lemma PPAD complete problem.

In this picture each triangular cell corresponds to some number in the range [0, 2^256) with 2^256-1 at the top. The left-hand side is colored red, the right hand side is colored green, and the interior points are colored randomly red or green pseudorandomly (say by the SHA256 hash of the numbers in the adjacent cells). A cell that has red-red-red or green-green-green vertices maps to itself. A cell that has red-red-green or red-green-green maps to the cell that you move to by keeping red on your right and green on your left. For nodes on the interior, one can compute images and preimages by evaluating the greenness or redness of the nearby vertices.

The issue is what’s to be done about the lower edge, which is colored blue. I think we can safely say that it’s hard to determine which cell is the end of the line starting with the top. But what about the other red-green-blue cells? One possibility is to map such a cell to the green-red-blue cell on the bottom that starts that chain, making a loop. But then an adversary could potentially grind to find a long chain, making it hard for an honest node to compute the map. A bit of random walk theory leads me to the conjecture that there is \approx 1/\sqrt n chance that a particular walk ends up taking more than n steps, meaning that if the honest users are limited to n hash computations, an adversary would probably only have to do O(n^{3/2}) to break the scheme.


BoltonBailey, I’m wondering about the exact connection of End-of-the-Line with Vitalik’s almost-bijection problem.

I am thinking the vertices are the integers 0…2^256-1 and there is an edge (u,v) if f(u) = v.

Since f is hopefully almost-a-bijection, the graph should look like a vast majority of little 2-cycles, with a small number of other vertices which are the violators and are supposed to be hard to find. ([Edit] No, this is wrong. f is hopefully almost-a-permutation, i.e. a single path through the graph with no repeats. I was picturing f and f^-1 together.)

But in End-of-the-Line, I think we are given one of the violators to begin with and are asked to find another, which I think makes it a bit different. In Vitalik’s problem we aren’t given any hint of where to start looking. (Edit: oh, another difference is that in my model, a vertex can have multiple predecessors if f(u) = v and f(w) = v. But End-of-the-Line disallows this.)

I also wasn’t clear on the Sperner connection. Partially because there a path has to go only to neighboring cells, whereas our function f could map anything to anything…

1 Like

But in End-of-the-Line, I think we are given one of the violators to begin with and are asked to find another, which I think makes it a bit different. In Vitalik’s problem we aren’t given any hint of where to start looking. (Edit: oh, another difference is that in my model, a vertex can have multiple predecessors if f(u) = v and f(w) = v. But End-of-the-Line disallows this.)

In Vitalik’s problem, the “start of the line” is 2^256-1. This is a node which is supposed to have no predecessor, since the image of the function is supposed to be [0, 2^256-2].

The possibility that some vertices have multiple predecessors might be helpful for intuition, but we are supposed to have an inversion function f^{-1} : [0, 2^{256} - 1) \to [0, 2^{256}). So presumably given f, we could define f^* such that f^*(x) = f(x) if f^{-1}(f(x)) = x and f^*(x) = \emptyset otherwise, which puts us back in the End-of-the-Line case where no node has two predecessors.

The Sperner example I gave is just one possible approach to a solution, you might be able to come up with another approach with a different f where the potential mappings are different. I just decided to look for approaches using Sperner because I knew it was connected to this End-of-the-Line problem.

1 Like

Nvm, it seems a solution might break info entropy theory, so it’s not possible.

The above proposal is limited in how hard k pigeons are to find.

The harder to find, the more time to takes to verify a hole, or the more randomizing info that needs to be kept secret, or passed to the user/attacker.

For 2^256, or some virtually unbounded size, it’s not feasible