# Hashgraph Consensus Timing Vulnerability

“They are not mathematicians or computer scientists” is an attempt to appeal to authority.

I wanted to use Hashgraph for Near, and read all the proofs very carefully, as well as had an implementation that I stress tested, and I am personally convinced that it does properly order transactions. I also understand the algorithm very well, so if you want to try to break it, I would be happy to answer any questions.

Note that before a round received can be assigned, at least one round needs to pass, and a round requires 2/3 of participants to observe that 2/3 of participants observed a witness message for the round (so called “strongly seeing”), and computing this property for each message cannot be done faster than O(n^2) of compute. Since a consensus for a particular transaction takes on the order of O(n log n) messages, it takes O(n^3 log n) compute on each node to reach a consensus for a single transaction (this gets amortized when you have multiple, so ordering k transactions takes way less than O(k n^3 log n), but I don’t have an exact estimate handy). This computation cost is why I personally ended up not using Hashgraph. But it somewhat should address your concern with the complexity – Hashgraph is by no means cheap.

Again, disclaimer: I’m attesting to the validity of the Hashgraph consensus algorithm, but do not endorse the Hedera Hashgraph protocol in general.

2 Likes

First of all I must say that kladkogex communication style appears very annoying to me, almost like a textbook version of a stereotype Russian troll. This is unfortunate as the topic is super interesting IMHO. Personally I would like to see a valid counterexample to the hashgraph consensus, so I decided to get into the discussion despite the annoying communication style.

Ok, first of all we need to clarify what we want to achieve and I propose the following:

Proof that hashgraph consensus is not a byzantine fault tolerant atomic broadcast algorithm under the assumptions made in the hashgraph paper.

Do you agree? (Please no moral, hate talk, accusations bla, bla… Its hard enough to go that road with you anyway. Stay scientific, ok?)

Ok, how do we do that? BFT atomic broadcast, which hashgraph claims to be, is defined by the following four axioms:

1.) Validity: if a correct participant broadcasts a message, then all correct participants will eventually receive it.

2.) Uniform Agreement: if one correct participant receives a message, then all correct participants will eventually receive that message.

3.) Uniform Integrity: a message is received by each participant at most once, and only if it was previously broadcast.

4.) Uniform Total Order: the messages are in a total order in the mathematical sense; that is, if any correct participant receives message 1 first and message 2 second, then every other correct participant must receive message 1 before message 2.

The task is then to find a concrete counterexample to one of these axioms. This boils down to finding a time-series of concrete hashgraphs (for users A(t),B(t),C(t),…) which violates one of the previous points after we run the algorithms of the paper on it.

Now kladkogex claimed in the initial post to have found such a situation. Unfortunately his example missed the actual hashgraph. So kladkogex can you please rewrite your example into an actual series of hashgraphs for users A,B,C,…, so everyone can run the algorithms on it (by hand) and see the contradiction? After all you made very strong accusations and according to the rules of science, which you brought stubborn into the game, the burdon of proof is on you. As you accused craigdrabiktxmq as being mathematically imprecise it’s now your turn to make your also mathematically imprecise proposed counterexample precise, so we all can see that it works.

I think its enough to write the appropriate hashgraph related to your counterexample with pen&paper and then send a picture of it here for discussion.

P.S: Leemon Baird was a professor of mathematic by the time he wrote the paper. He wrote >40 other math papers. This of course does not mean, that it is correct. Personally I like the style of the hashgraph paper.

However, I’m not a fan of Hedera, not the least! I think they are like the pest for decentralized payment and I would love to see them fall. But we have to do it properly. Its David against Goliath and David can not afford to rank high on the crackpot index, if he claims any chance in this battle.

7 Likes

Atomic Broadcast and Weak Synchrony?
I will repeat what is already written and referenced about the hashgraph timing:
“The validity is dependent on the (directly or indirectly) outgoing edges of the transaction, which represents the nodes that have validated it. A scale-out throughput can be achieved if the acquirement of the complete graph is not obligated for all nodes. Then, the validity of the transactions is compromised due to its dependency on the validators”.
A Scale-out Blockchain for Value Transfer with Spontaneous Sharding, Zhijie Ren and four more authors. Link here: https://arxiv.org/pdf/1801.02531.pdf

Actually my counterexample is not off point. If Bob holds a transaction submitted to Bob, that happens before the transaction reaches the network. This is something no consensus mechanism can prevent. If the receiving node fails to send transactions on to the rest of the network, there is no action that can be taken by the network - any network.

In Hedera this is easily defended against. Submit your transaction to multiple nodes. The first transaction gossiped to the network will be accepted and ordered. The subsequent transactions are detected as duplicates and rejected.

This paper refutes the ability of a Hashgraph network to scale at O(N) where N is the number of nodes in the network. Hashgraph does not claim an O(N) scale out as a function of the number of nodes. The performance charts in the Hedera white paper clearly show that this isn’t the case.

Here is the original paper, where hashgraph took the idea of virtual voting from. It is based on the old idea of Lamport clocks and gives a pretty clear analysis of the run time of this family of algorithms:

It is a shame and IMHO plain evil that this Leemon bird-person did not mentioned his sources properly. SEction 4 clearly defines virtual voting as votes deduced from the causality structure in the graphs that occure if messages acknowledge other messages.

Roughly speaking, the local run time is O(n^3), because every user will have at least one witness nodes in each round (more under forking [called mutations in the original paper] ), each of these witness nodes execute a consensus algorithm step for every of the n witnesses in all previous rounds that have their fame not decided, yet. This is O(n^2) for a single round. Estimations from bubble-hashgraph (test implementation on GitHub ) then suggest that has to be done roughly for n previous rounds giving roughly O(n^3) runtime.

What is optimal is network complexity. This is the case with all virtual voting based algorithms from the Moser Smith family. But hashgraph is fare less optimal then Mosers original approach when it comes to local computation complexity (O(n^3))

Therefore for everyone who is not bothered by the non-locality of these algorithms (The number of nodes or amount of stake must be known), they should really implement one of Mosers original algorithms instead. Much better performance on the local process. IMHO Nakamoto is superior as it does not depend on the number of user or any overall staking value but is FULLY LOCAL

1 Like

Vow :-))

The Hashgraph bots got really personal, touching on my nationality.

I am not a Russian troll simply because I am Ukrainian. Having said this, I know great scientists and people from Russia. I am a US citizen, who was named Directors Fellow at Los Alamos National Lab (US) and who received Otto Hahn Medal for the best PhD research in Germany. I am also author of more than 20 publications in top scientific journals - something that the Hashgraph whitepaper can never achieve.

If my style is annoying to snake oil sellers I am more than happy about that I stand by the $10,000 offer, lets make it$20,000 having the interest to this message thread.

To determine super majority for strongly seen you just add up the stake of the voting nodes including their proxy stake and divide it with all 50bln coins available. It must be more than 2/3.

On the first page of the Hashgraph whitepaper it is made clear that Hashgraph is NOT deterministic (because of the undeterministic nature of the gossip protocol I guess) , therefore Hashgraph being ABFT doesn’t contradict the FLP theorem.

Having said that, according to my understanding your concern is easy to address. Deferring a transaction by a certain node (either because of network issues or due to malicious behaviour) is just to be considered as a byzantine failure of one node. As long as 2/3 of the community agrees upon the order of the transactions, the consensus order is determined with finality. If a transaction later comes in with a considerably different timestamp, it just doesn’t matter. Otherwise really just one crash or malicious node would make the network lose its liveness or even its consistency.
So the goal is not to have a 100% agreement of all nodes about each and every transaction timestamp. At that level you can have great differences in perception due to errors or malicious activity. The point is that as long as not more than 1/3 of all nodes (or stake) have a problem and therefore 2/3 of them can agree upon the order the transactions, the consensus is reached with finality even if one third of the nodes are sleeping for a year before starting to take part in the consensus again.

I had a similar issue in a sharded setup which I discussed for a long time with Hedera executives and technical stuff. In that case it was much harder to sort out the problem, but in the end we could also find a solution by choosing the right kind of events on the functional level to gossip out as transactions in the example application I made up to illustrate the problem. The clue was to use events directly generated by the users and not events calculated by the application representing its state (e.g. user 1 pulled the trigger is a good event, whereas user 2 has died because of the bullet is a bad one.)

In this case, I guess for mathematicians it is clear that the algorithm and the proof are correct. With a machine generated Coq proof further enhancing the level of certainty provided by a peer reviewed paper you would have a hard time to prove otherwise. According to Leemon Baird no computer generated math proof has ever been falsified so far. So, I guess you need to dig into it a lot deeper if you want to achieve this for the first time in history for the Hashgraph algorithm.

I had a look at the paper, but I cannot see it refuting Hashgraphs scale out capability. At the point where the paper cites the Hashgraph paper, Hashgraph is thrown together with IOTA and other DAG algorithms and the whole group of algorithms gets explained on the basis of IOTA. Apart from the fact that Hashgraph is also a DAG it has very little common with the Tangle from IOTA. To be concrete, the edges in the DAG of Hashgraph have nothing to do with the validation of transactions as opposed to the Tangle of IOTA.

So the paper seems to be very unspecific and shallow for me in this respect. Or did I overlook something?

Given these discussions I think it would be really important to have an official statement or paper from Leemon Baird himself on the formal scalability figures of Hashgraph - both concerning communication cost per transaction and local computational cost per transaction. Also the question why exactly hashgraph can be seen as undeterministic should be answered, so that the contradiction to the FLP theorem can be sorted out with precise arguments.

There has been some confusion regarding Hashgraph in this thread, which I believe I should be able to clarify to some extent. At the same time I would like to emphasize that I’m not endorsing Hedera in any way, and actually I’m surprised that none of the Hedera’s employees bothered to answer all these important questions that have been asked here.

First of all, even though (as many of us agree here) the whitepaper is terribly written, there is no doubt that mathematically the protocol sketched there is correct, i.e., it achieves safety and liveness. However, there are some caveats…

First of all, regarding communication complexity, in the “optimistic setting” (i.e., if you assume no adversary, perfect randomness etc.) this would be O(n \log n) per node, per round. This is pretty good, yet comes at the expense of bad latency (when compared to partially synchronous protocols like PBFT or HotStuff). Furthermore, as mentioned, this assumes no adversary. In the adversarial setting, dishonest nodes may fork events and consequently can build so called “Fork Bombs” (see https://arxiv.org/abs/1908.05156 Section H) and essentially make the communication complexity infinite.

Agreed 100%. This is a shame. What is even more of a shame is that this paper is not referenced in the whitepaper but it is mentioned in the patent itself https://patents.justia.com/patent/10318505 under “Other references” and still the patent went through. This means that it was spotted by the patent examiner, yet was disregarded anyway. Of course, the examiner/s are not the ones to blame here, as they work under very tight timeframes, and cannot always spot such things. In fact, figuring out that mathematically, this work of Moser et al. and Hashgraph are virtually the same requires deep understanding and expertise in the field of BFT systems.

Random gossip is not quite enough to overpower FLP impossibility (intuitively – the adversary can still arbitrarily manipulate the structure of the graph). Hashgraph’s source of randomness are actually local coin flips performed by all the members independently in the Consensus algorithm. While this is enough to overcome FLP, what is proved in the whitepaper is only that the consensus is “eventually achieved”, yet going a little deeper into the proofs one can see that the probability of achieving consensus per one round is roughly 2^{-2n/3} and thus the only provable guarantee provided in this paper is an exponential number of rounds to terminate. One can also show that in the ABFT model this is not only the upper bound, but is achieved (see https://arxiv.org/abs/1106.5170). Note that this is under a strong adversarial setting – however, this is the model Hashgraph is claimed to work within.

2 Likes

Hey Damian,

I think you got the probability to reach consensus per round wrong. The number you gave is the probability to reach agreement in a coin flip round. But coin flip round do rarely occure. In fact the hedera test net suggest that coin flip rounds almost never occur in Hederas actual network (This of course, would drastically change in a permissionless setting, where there are sufficient attacks. But no matter what they say Hedera IS NOT permissionless). Leemon explizitly said in his paper, that he could have implemented a stronger coin, but the coin is just a theoretical possability that almost never happens in reality. Therefore the algorith ist faster, if one just uses such a “fully local” coin…

The real computational burden, is that essentially (up to forking) every veritual witness message has to execute consensus on every vertial witness message in a previous round. That is very inefficient and not necessary. It leads to the slow local run time.

Can you explain, why the network complexity is O(n log(n))? Is n the number of users, or the amount of bits in the overall network? What is that number? I mean, of course you have to send the transactions themself, there is no way around it. Then what is interesting is the additional amount of information which should be O(1) or maybe O(log(m)\$ where m is the existing number of messages overall.

But more important: GOOD CATCH ON THE PATENT!

So at least after the patent approval, Hedera must have known, that the original paper exists, because its in the patent. Amazing. That’s proof that they tell porky-pies to the programmers and investors, who have putted hard work and money into their believe in Hedera. Even today they never mention this anywhere.

=> So please everybody spread the word. Tell people about Moser’s paper. I know most people in Ethereum do not see Hashgraph as an actual threat, because its not fully local (depend on the number of participant/stake), but even then. Don’t let theses Machiavellist get away with it.

The more interesting question is: Can this be made into an attack vector against Hedera and a solid strategy for patent revocation?

Ethereum foundation, or other free software foundations, might give out a grand (or grands) to lawyers for bringing this to court or at least to some patent committee.

I mean, come on! Ethereum foundation spend quite bit of money on less promising projects, so this should be something to consider. In any case it would set clear boundaries for what the decentralization movement can tolerate and what not.

However from the perspective of a systems engineer, one should not consider the patent as especially important, simply because the hashgraph is not that important. The marketing/branding was the work of a genius though, but in reality everyone who want to build around the hashgraph but can’t do it in the US, due to the patent, should use one of Mosers algorithm instead. They perform at least as good as hashgraph and sometimes better. A while ago I implemented them to compare them with the hashgraph SDK. They are faster and easier to implement anyways. Moreover they are around for moren then 25 years and the paper is peer reviewed.

There is a large gap between theory and “practice”. It is clear that in the tests the coin rounds will not appear because the tests are not run in an adversarial setting (i.e. no node is cheating, no-one is DDOS-ing anyone else, etc.). In fact, it is not possible to reliably “model an attack” in a test, and thus statements of the kind “In tests it works super fast, but we can only prove exponential-time bounds” are always a little shady.

What I’m saying is that in the ABFT model (which means, roughly that there are malicious nodes in the network and the adversary can arbitrarily delay messages) one can give a valid strategy for the adversary such that:

• The consensus never happens in a round that is not a coin round
• In a coin round, the consensus happens with probability roughly 2^{-2n/3}, i.e., exponentially small

Of course, one might argue that this model is very harsh and thus not very practical, but that’s a different discussion – there are some convincing arguments that this is still the best way to model Internet (see for instance the discussion in HoneyBadgerBFT https://eprint.iacr.org/2016/199.pdf).

That’s correct. However, let me comment on that too. At the time of writing Hashgraph’s whitepaper there was no ABFT implementation of “shared coin” known that would work without a trusted dealer. All of us know here that this then essentially disqualifies such a protocol for the kind of applications it is meant for. Very recently a paper https://arxiv.org/abs/1908.05156 was released (that I’m a part of) that gives the first fully asynchronous and without-a-trusted-dealer implementation of a shared coin. Still using it as a black box in Hashgraph is far from obvious.

1 Like

Sure! By n I mean the total number of nodes, i.e., participants in the closed committee that run the Hashgraph protocol. Further, O(n \log n) is the amount of metadata (i.e. excluding transactions) that is required to “advance to the next round”. The concept of rounds might not be super clear here (it comes from the theory of BFT systems), but at least it is also called a “round” in the whitepaper. What “a round” essentially means is that to advance from round r to round r+1 you need to hear some messages from all (or more precisely \geq 2n/3) other nodes that have reached round r.

This number comes from the fact that the nodes need to emit roughly \log n layers of events so that everyone reaches a new round (I can explain more upon request, since this might be still too handwavy…). So there are (again, this is in the optimistic case) O(n \log n) events per round, each of which contains a constant number of hashes and signatures, hence I count each as O(1).

Now, each event also contains some transactions – the overhead per transaction is now very tricky to estimate. This boils down to counting how many times (on average) a single transaction is input to different events. If every transaction happens to be input only once (meaning that there is perfect coordination between nodes behind the scenes who should input which transaction) then each transaction only incurs O(1) communication per node. But avoiding duplicate transactions is very tricky, as it can lead to censorship and can also increase latency to run this “coordination in the background”.

It is fair to say that per one round the optimistic (i.e., no adversary) communication complexity of Hashrgaph is:

O(n \log n + T\cdot N_{dup}),

where T is the number of transactions input in this round and N_{dup} is the average number of copies of a single transaction that is input. Again, it is very hard to keep N_{dup} small (constant), as this might lead to transaction censorship and large validation latency. Now one can see that if there is only a small number of transactions then the size of metadata actually majorizes the communication cost coming from transactions, in which case the overhead is not constant. Also, as mentioned in one of my previous posts: in the adversarial setting (so the real one) there is a fork spam attack that can drive communication complexity to infinity.

Ok. That is fair enough…

In any case, I think it was MIcal, not you who gives the first fully asynchronous and without-a-trusted-dealer implementation of a shared coin, based on cryptographic sorting (Micali: byzantine agreement made trivial.). His algorithm is not asynchronous but the “less common coin” is…

This is a little bit of a digression, bu just to clarify: Micali’s “Frequently Magic Coin” is a nice idea, but I’m not aware of any asynchronous implementation of it. The problem is that in the asynchronous setting, the adversary can force half of the honest nodes to hear the “lowest hash node” and the other half to not hear it (since the adversary can delay messages), and hence they will never decide on the same “lowest hash node”.

Also this gives a weak common coin only – i.e., only with probability 1/3 this actually outputs a common value. Such a weak randomness source was known for a long time (yet was much more complicated and communication intensive) see the work of Canetti and Rabin (https://dl.acm.org/citation.cfm?id=167105). For blockchain applications you need a strong common coin that always (with probability 1) outputs a common random value.

There is a BLS-based strong common coin that we use at SKALE. Basically a BLS threshold signature of a known message is a common random number

Yes, good point, that is the most common technique, introduced by Cachin in his work https://link.springer.com/article/10.1007/s00145-005-0318-0

The idea is quite powerful and at the same time rather simple, and I believe this is what one should use for reliable randomness generation in blockchain. However, the nontrivial thing is how to generate the threshold keys in a trustless way.

This problem of generating keys has a name: DKG (Distributed Key Generation) and has a long history of research behind it. Summarizing very briefly: essentially all the past approaches work in the synchronous network model (such as Gennaro et al. https://link.springer.com/article/10.1007/s00145-006-0347-3) which is generally agreed to be ill-suited for blockchain applications. The only results that go beyond synchroncity are:

1. “DKG in the Wild” by Kate et al. https://eprint.iacr.org/2012/377.pdf – it offers a partially synchronous solution, yet is quite heavy: O(n^3) communication per node
2. Aleph protocol that I’m a coauthor of https://arxiv.org/abs/1908.05156 – offers full asynchroncity, O(1) rounds and O(n^2) communication per node
3. This work https://eprint.iacr.org/2019/1015 by Kokoris et al. – full asynchronicity but latency up to n rounds and O(n^3) communication per node.