Two observations from the inference serving side (I used to work as a machine learning engineer at Hugging Face, @omarespejel in X):
1. The refund mechanism leaks more than refund values.
The server doesn’t just observe C_max - C_actual. In production LLM inference, each request produces a rich feature vector: output token count, time-to-first-token (which correlates with input length and KV cache state; vLLM’s automatic prefix caching is the canonical example), generation latency, and if the server uses speculative decoding, the draft-model acceptance rate, which varies systematically by prompt domain and task type (see “The Disparate Impacts of Speculative Decoding,” arXiv:2510.02128). Over N requests, a straightforward clustering algorithm on these features re-links anonymous requests to the same user, even with perfect nullifier unlinkability. This is traffic analysis, same class of attack as flow correlation in Tor (DeepCorr achieves ~96% correlation accuracy from flow metadata, arXiv:1808.07285).
This isn’t theoretical. vLLM’s chunk-based prefix caching had a documented timing side channel (CVE-2025-46570, GHSA-4qjh-9fv9-r85r), where cache-hit timing differences achieved an AUC of 0.99 with 8-token prefixes, enough to verify whether two requests share context. Patched in vLLM 0.9.0, but the fundamental issue is architectural: any shared-cache inference server leaks request similarity through timing unless explicitly mitigated.
2. We can eliminate the refund circuit entirely, and probably should.
Instead of the server issuing signed refund tickets for C_max - C_actual (which requires the ZK circuit to verify server signatures and sum refund accumulators), have the user commit to an output token budget T_out from a small set of fixed classes (e.g., 256 / 512 / 1024 / 2048 tokens). The server generates up to T_out tokens and charges a flat price(T_in_class) + price(T_out). Users select from the same fixed set of input-length classes. Each (input class × output class) cell provides k-anonymity; every request in a cell looks identical from a billing perspective.
No refund, no variable signal, no server-signed tickets, no accumulator, no refund summation circuit. The protocol gets dramatically simpler. The trade-off is ~20-40% cost overhead due to unused token budget, but inference costs are dropping fast enough that this is tolerable, and it’s a strict improvement on the privacy/complexity Pareto frontier.
This also resolves a trust assumption in the current design: with variable-cost refunds, the server reports C_actual and the anonymous user cannot dispute without deanonymizing. A malicious server can under-report refunds to extract surplus. With flat pricing per class, there’s nothing to misreport.
For the remaining timing side channels (TTFT, generation latency), the combination of quantized input classes and padded output results in the server seeing approximately the same resource profile for every request in a given cell.