Random data compression -- starting to think its possible, tbh

Ahoy-hoy Ethereum Researchers,

For the past few months I’ve been exploring the potential of compressing random data and I’m requesting feedback on my current approach. I will describe how my algorithm works and where my research is at towards the end (the code for all of this is around 95% complete.)

I want to state that I don’t believe random data can be compressed directly – HOWEVER: and this is important – it might be possible to approximate random data and compress that approximation.

Here is how my compression algorithm works:

  1. A 1024 byte buffer is split into 17 bit chunks.

  2. Each chunk is ‘stored’ in a golomb-coded set (GCS) and prefixed with an offset id (e.g. chunk 1, chunk 2 …) . Space required = 706 bytes.

  3. For every offset in the GCS a list of candidates are generated by combining the offset ID with every possible 17 bit value.

  4. Within a candidate list the chunk offset is recorded, creating a list of 482 chunk offsets. The first bit of every chunk is also used as a way to make the candidate list smaller (total size is 61 bytes for the bit filter.)

  5. The chunk offsets are grouped into pairs [x, y], and the pairs are grouped into sets of four [[x, y], …], … There are 61 such sets.

  6. A modified proof-of-work algorithm is done for every quad-set:

    1. For every nonce in a 4 byte range calculate standard proof-of-work for each pair.
    2. Increment total cumulative PoW.
    3. Keep a list of nonces sorted by total cumulative PoW.

The nonces require 241 bytes to store and 61 bytes have been used for the bit filter. That only leaves 13 bytes but as I’ll show later – it may be enough.

Decompression

To decompress the data involves reconstructing the quad-sets and using the offsets therein to dereference the original data inside the GCS. The decompression algorithm is based on the following observations:

  1. There is a certain number of zero-prefix bits associated with the nonces that have the best possible cumulative PoW.
  2. We don’t know ahead of time where these prefix bits may fall among the pair hashes. But since there are multiple results it’s possible to impose a rule-based filter. E.g. all hashes must have at least 4 zero-prefix bits.
  3. This allows for a heuristic filter to be used for brute forcing the quad-sets using the 4 byte nonce and the number of candidates for each chunk.

The filtering approach will return a list of possibilities for each quad-set. The offset among these possibilities will be saved in meta data, along with the nonce. There is still 13 bytes remaining, so the last challenge is to be able to encode every offset in less than 13 bytes. At first inspection this doesn’t seem possible because if we assume a worse cause scenario of 11 bits per offset – that totals 83 bytes. But the nonce list contains patterns and that may be key to solving this problem.

On average it requires a certain number of attempts to find high-value PoW for the nonces. So most of the nonces tend to be high value. A simple scheme that divides all of the nonces by the lowest nonce, saves the quotient, remainder, and min nonce saves an additional 60 bytes. By the way: all of the parameters chosen so far are to prevent exponential growth of result sets – allowing the code to finish – or for space-saving / CPU efficiency reasons. They are not arbitrary.

Here is the problem

This is where my research stops and I’m left with a bunch of unanswered questions. What range of offsets can we expect for all the quad-set candidates after filtering them? Can they be compressed like the nonces can? Can the bit filter be compressed? Is filtering every quad-set even possible? Can the bit filter be eliminated? What else am I missing?

Due to the large amount of compute resources required for filtering I have only confirmed successful recovery of a single quad-set. My private compute cluster is only 96 cores at the moment which isn’t enough to continue this research (it would probably take a month to finish one run of the algorithm.) I believe that I’m close to a working algorithm but I lack the resources to prove this. If any of you think this will work or have access to a ton of CPU cores (around 512 if not more) – I think I can prove my approach will work.

All feedback welcome.

PS: Here is the current code just to give you a more precise idea: https://github.com/robertsdotpm/rand

My code is already mostly complete and shows how to marshal cryptographically random data into a GCS in less space and use a series of nonce puzzles to recover it.

The code points to my Spark cluster and isn’t finished yet though so don’t bother trying to run it.

This sentence immediately strikes me as being false. Finding a correct nonce doesn’t require a roughly constant amount of work, it requires a random amount of work. For example, here’s some sample lowest nonces for the PoW problem “find me the lowest number such that the first byte of hash(data + nonce byte 0 + nonce byte 1) is zero”:

from hashlib import sha256
def hash(x): return sha256(x).digest()
def check(data, nonce): return hash(data+nonce.to_bytes(2, 'big'))[0] == 0

def mine(data):
   for i in range(65536):
      if check(data, i): return i
>>> mine(b'cow')
100
>>> mine(b'dog')
123
>>> mine(b'horse')
698
>>> mine(b'pig')
553
>>> mine(b'donkey')
226
>>> mine(b'monkey')
98
>>> mine(b'unicorn')
273
>>> mine(b'human')
308

So there are no clever patterns here that you can take advantage of.

Hello Vitalik,

You are definitely right about “finding the correct nonce for a given prefix” being random. If there’s a fixed 4 byte field and miners are searching for a nonce then it could be anywhere within (2 ** 32) - 1 inclusive. But the algorithm here is a little different because it searches the full range no matter what. Trying more nonces means increasing the chances of finding a ‘good’ nonce.

So we should expect to have found more good value nonces after the nonce has reached a high value than a low value. We can’t say what exactly the prefixes will look like or exactly what the nonce will look like. But after searching: it’s more likely for high value nonces to relate to good value prefixes, then if we were to only search a smaller range.

I don’t know how well I’m describing this but here’s a list of nonces from this algorithm:

[1365061047, b’\xb7-]Q’, 0.25, 0.015625],
[1516047276, b’\xac\x0b]Z’, 2.0, 0.125],
[1921799507, b’SU\x8cr’, 2.0, 0.125],
[2198046991, b’\x0f\x89\x03\x83’, 2.0, 0.125],
[2247385947, b’[c\xf4\x85’, 2.0, 0.125],
… etc

This is what the list of nonce candidates looks like for a single quad-set after filtering it. Here we see that the nonces are all high-value. Because there are multiple candidates anyway – we can force the nonces to at least be within a certain range. Which leads to the following being possible for nonce compression:

# Sort from largest nonce first.
nonce_infos = sorted(nonce_infos, key=lambda k: k[0], reverse=True)

# Quotient bit size
d = nonce_infos[-1][0]
q = int(nonce_infos[0][0] / d)
q_bits = q.bit_length()

# Find remainder bit size.
r_bits = 0
for nonce_info in nonce_infos:
    r = nonce_info[0] % q
    if r.bit_length() > r_bits:
        r_bits = r.bit_length()

print(q_bits)
print(r_bits)
print(32 - (q_bits + r_bits))

print(d)
print(d.bit_length()) # 4 bytes for divisor.

# 9 bits for quotient
# 9 bits for remainder

It’s a very, very simple algorithm for compressing the nonces because they all seem to fall within a similar range. But the amount of space that it allows us to save in this context is quite extreme: you go from 244 bytes to (138 + 4)~ bytes. I think there should be more optimisations like this.

I’m even thinking that with a hash chain like construction between the quad-sets that it may be possible to remove final calibration offsets completely (quad-set offsets that will be needed after using the nonce puzzles for filtering quad-set candidates) – in which case this algorithm would actually work as it already is with minor code changes required (already using hash chains.)

Thanks for taking the time to read this so far, btw.

Right, but that also means that if you then try to “mine” for the nonce a second time to get the same nonce when decompressing, then you’re likely to not get the same nonce again.

More generally, I agree that if any magic tricks of this type worked, then you could compress random data, but that’s precisely why none of them do.

Agreed, and this is why I save information on the hash prefixes and list the quad-set offset --after-- decompression / filtering among the remaining bytes.

My heuristic for filtering currently looks like this:

  1. Choose a nonce where all zero prefixes for the pair hashes are < 8 and > 0.
  2. The second and third hash should have at least 3 zeros.
  3. Record the exact number of zeros in the first three hashes. Save this to the output format.
  4. Skip all results that don’t satisfy those conditions when filtering (this quickly cuts through over 1000 billion candidates by some definition of quick and keeps the candidate set more targeted!)
  5. You’re left with a set of candidates for a quad-set.
  6. Record the offset of the correct result among the list of candidates. Save this to the output format.

I still don’t know if this meta-data can be expressed compactly within the bytes that remain or if the filtering approach will work for all quad-sets. But feel that there are still avenues to make this work given that there’s still tons of space left and I’ve only just started working optimising the final decompression stage. I’ve been able to recover the first quad-set for my test data. But I still don’t know enough about what data to expect during the filtering stage to know if I’ll be able to fit it all in.

I think there are still other improvements to be made too, when tying together the quad-sets in a chain that might eliminate the need for the offsets in filtering step 6. But I haven’t done much work in that area yet.

Edit: forgot to add – there’s already plenty of space left for the final offsets (2 ** 14) - 1 = 16383 and from what I’ve seen offsets will only need to use up to 8 bits. It would probably be easy to modify the heuristic algorithm to only use 1 or 2 fixed prefixes and compress all the remaining data neatly so it fits. Then you’re left with a working random data compression algorithm for data over 1KB – 1KB is the min size that is compressible with this algorithm.