Lots of PoS based BFT algorithms are claiming to have very high rates of throughput*, combined with low latency. Much more so than what one finds in deployed Nakamoto consensus chains.

What critical tradeoff or assumption is made in these algorithms which unlocks this capacity improvement?

*I mean high rates on the base layer, not about enabling sharding or any L2 solution.

With POW, when a block is mined (miners)
Compute the new block ASAP => hash the hell out of it => Be the first to propagate it
With POS, when a block is mined
Proposers compute the new block => Validators validate and vote it => it gets part of the canonical chain

With POW, you are incentiviced to make that time to hash the block as long as possible to have a chance to find the solution to the puzzle. So you need the computation of the new block to be as fast as possible.
With POS you can take all that time to compute transactions, since it doesnâ€™t matter how much time you spend as long as you vote in time and your vote gets propagated.

There are some gains to be had wrt more strict timing of events (not relying upon the random distribution of blocks via pow) but these are generally minor.

I am afraid I really donâ€™t understand what you are saying, even though your descriptive premise of POS/POW seems about right, in particular what does this mean?

you are incentiviced to make that time to hash the block as long as possible to have a chance to find the solution to the puzzle

This was my intuition also, however, I am seeing claims about 100s of validators, across multiple continents, sub 10s finality and 100s of transactions per second. This just does not seem to be something I would expect normal Nakamoto type chains to be able to handle, but perhaps I am wrong.

PoW chains can theoretically be configured to target any arbitrary block time but the block time is inversely proportional to the orphan rate (rate of unincluded valid blocks due to network latency). The higher the orphan rate, the lower the economic incentives to mine because more work is left unrewarded, the lower the network hashrate, and therefore the lower the security of PoW.

Ethereum solved this problem via the introduction of â€śuncle blocksâ€ť, basically issuing an extra reward to orphaned blocks in addition to included blocks. This is how Ethereum can reach a target block time of 15 seconds instead of Bitcoinâ€™s 10 minutes. This does however force the community of ETH tokenholders to subsidize the cost of security through extra inflation. If we decrease the block time even further, sub-10s as you proposed, we would have to issue rewards for more orphaned blocks and subsidize the security of the network via more inflation, driving ETH value down over time.

To answer your original question about high throughput BFT consensus algorithms (e.g. Tendermint), byzantine fault-tolerant finality is based on the assumption that at least 2/3+1 of preset validator nodes will remain online and honest at all times. Because the number of validator nodes is preset, finality is achieved once a required quorum of validator signatures is present. There is no need for further block confirmations because of the 2/3+1 assumption.
Additionally, since weâ€™re already making the 2/3+1 assumption, non-validator nodes generally do not have to run full nodes or download full blocks to verify the validity of the chain because they rely can query information about the blockchain from the validators based on the assumption that 2/3+1 of them are honest.

Believe it or not, PoW algorithms can also be fast. The reason why PoW is slow is network and not PoW itself.

At Skale we know how to do a PoW which will have 1000 tps or more.

What is hard to do with PoW is fast finalization, because you need to wait for a larger number of blocks to be secure. High throughput PoW is totally possible, fast finalizing PoW - not really, because it is forkful