The Origins of MEV: Systematic Attribution of Arbitrage Opportunity Creation at Scale

This paper is authored by the MEV-X research team together with academic collaborators from MIPT, HSE University, and Skoltech.

This paper formalizes the MEV opportunity attribution problem: given an executed atomic arbitrage transaction T_{arb} that extracts profit \Pi, which preceding transaction created the price disbalance enabling that profit? We design and evaluate four attribution methods (bot-data-driven, simulation-based, coefficient-based, and Shapley-based) and apply them to 360,026 atomic arbitrage events on Polygon (March 2026, $334,799 in extracted value). Central finding: 96.7% of atomic arbitrage opportunities trace back to a single source transaction, consistent with the hypothesis that in competitive MEV markets, searchers extract value immediately when opportunities arise rather than waiting for multi-transaction sequences. MEV creation is highly concentrated: a small subset of protocols accounts for most opportunities, and concentrated-liquidity AMMs dominate despite not having the highest trading volume.

arXiv: 2604.27979


TLDR

  • The MEV literature has focused on extraction: how value is captured. The creation side - which on-chain transactions generate the conditions bots exploit - has no systematic treatment. We formalize it and build tooling to attribute it at scale.

  • We define a value distribution \{\phi_i\} over candidate source transactions satisfying \sum_{T_i \in C} \phi_i + \phi_{base} = \Pi, where \phi_{base} = \mathcal{M}(S_0, T_{arb}) is the profit attributable to pre-block state. EVM determinism makes this tractable: replaying a block without a candidate transaction yields an exact counterfactual, not a statistical estimate.

  • Single-source hypothesis validated: 96.7% of arbitrage events have one transaction accounting for >70% of positive Shapley value. Only 83 events (3.3%) show genuinely tied multi-source attribution.

  • Four methods, different trade-offs: bot-data (94.2% accuracy vs triangulated ground truth, 38.4% coverage, 8 ms/event), simulation (91.7%, 99.1%, 12.3 ms/event), Shapley MC (ground truth, 2.1 s/event). The coefficient method achieves 77.2% agreement with simulation at 0.8 ms/event. Recommended workflow: coefficient for screening, simulation as primary, Shapley when methods disagree or multi-source attribution is suspected.

  • MEV creation is concentrated: the top 1% of arbitrageurs capture 80% of extracted value; the top 1% of opportunity-creating transactions generate a similar proportion of MEV opportunities. Uniswap V3 (58.0% of opportunity transactions), Algebra (29.6%), and Uniswap V4 (28.9%) lead, despite Uniswap V2 having higher overall trading volume.

  • New efficiency metric: each opportunity-creating transaction attracts only ~1.6 successfully executed arbitrage transactions on average, standard multi-bid practice, not excessive competition. This ratio is only measurable via attribution.


1. The attribution problem

MEV research has focused on extraction mechanisms: how searchers capture value through arbitrage, liquidations, and sandwich attacks, and on mitigation. The creation side is underspecified. Which transactions generate the conditions extractors exploit? Which protocols generate the most arbitrage opportunities? Which users unintentionally create value for bots?

Without attribution, these questions have no systematic answer. Existing work either measures aggregate extraction (Qin et al., 2022) or locates approximate sources by scanning preceding swaps (Torres et al., 2024), but neither quantifies individual causal contributions to extracted value.

We formalize the problem as follows. A block B is an ordered sequence (T_1, \ldots, T_n) with sequential state evolution S_k = \Sigma(S_{k-1}, T_k). Let \mathcal{M}(S, T_{arb}) be the profit from executing T_{arb} in state S. Total extracted profit is \Pi = \mathcal{M}(S_{k-1}, T_{arb}). The candidate source set is all transactions T_i preceding T_{arb} that interact with the same liquidity pools. We seek \phi_i for each candidate T_i satisfying:

\sum_{T_i \in C} \phi_i + \phi_{base} = \Pi, \quad \phi_{base} = \mathcal{M}(S_0, T_{arb})

where \phi_{base} is the profit attributable to pre-block state. A source transaction may precede T_{arb} within the same block or up to D = 100 earlier blocks.

The key property that makes this well-posed is determinism: EVM state transitions are fully reproducible from an archive node. “What would \Pi be had T_i not executed?” is computable, not estimated. This distinguishes blockchain attribution from systems attribution work that operates under probabilistic assumptions.

We focus on atomic arbitrage, formally defined following Vostrikov et al.: transactions with at least two swaps (N \geq 2), non-negative net balance change for each asset (\Delta(A) \geq 0), and positive profit after fees \tau and prioritization bids \beta (\text{Profit} = \sum_{A} \Delta(A) \cdot P(A) - \tau - \beta > 0). This is the most frequent and fully observable MEV category, with the entire causal chain contained in the on-chain record.


2. Four attribution methods

We implement four methods spanning the accuracy/cost spectrum. Simulation, coefficient, and Shapley are retrospective, they analyze finalized blocks. Bot-data operates in real time on pending transactions and is used as external validation rather than primary attribution.

2.1 Simulation-based attribution (primary)

Counterfactual replay to isolate the causal impact of specific transactions on arbitrage profitability. Three phases:

Phase 1 — Candidate filtering. Eliminate transactions that do not interact with any pool in the arbitrage route.

Phase 2 — Binary search for the edge transaction. Binary search identifies T_{edge}: the boundary beyond which the profit of T_{arb} drops to 5% of \Pi. The search begins within the current block and extends backwards up to D blocks. This gives O(\log |C|) simulations for this phase.

Phase 3 — Backward impact calculation. Traverse backwards from T_{arb} to T_{edge}, computing the marginal impact of each transaction:

\text{Imp}_i = \Pi(T_{i+1}) - \Pi(T_i)

where \Pi(T_i) is the profit available if T_{arb} were executed immediately after T_i. The source is the transaction with maximum positive impact, ties broken by proximity to T_{arb}:

T_{src} = \arg\max \text{Imp}_i \quad \text{over } T_i \in [T_{edge}, T_{arb}]


Simulation-based attribution pipeline. (1) Filter transactions by pool intersection (yellow). (2) Binary search backwards to find edge transaction 𝑇𝑒𝑑𝑔𝑒 where profit drops below 5% threshold (blue). (3) Compute marginal impacts via backward pass; select source transaction with maximum impact (green)

In our dataset, 99.3% of attributable opportunities have T_{edge} within 7 blocks of T_{arb}, making D = 100 a conservative bound.

2.2 Shapley-based attribution (ground truth)

Cooperative game theory applied to fair attribution when multiple transactions may jointly contribute. The candidate set C forms a cooperative game with value function V(S) = profit achievable after executing exactly the transactions in S. Shapley value:

\phi_i = \sum_{S \subseteq C \setminus \lbrace T_i \rbrace} \frac{|S|!(|C|-|S|-1)!}{|C|!} \left(V(S \cup \lbrace T_i \rbrace) - V(S)\right)

This satisfies efficiency (\sum \phi_i = \Pi - \phi_{base}), symmetry, dummy, and additivity.

Exact computation is O(2^{|C|}), feasible only for |C| < 20. For |C| \geq 20, we use Monte Carlo: N random permutations of C, Shapley value estimated as the average marginal contribution. At N = 1000 samples, estimates stabilize within 5% of asymptotic values after approximately 500 samples.

2.3 Coefficient-based attribution (fast screening)

The K-value method: compute the price multiplier coefficient k representing theoretical profitability of the arbitrage cycle at infinitesimal size. Attribute the opportunity to the transaction maximizing the marginal coefficient change:

\delta k_i = k(S_i) - k(S_{i-1})

Requires only pool reserve data from transaction logs: O(1) per candidate, no archive node replay needed. Does not account for liquidity depth or slippage, degrades for large-volume opportunities. Suited for initial screening only.

2.4 Bot-data attribution (external validation)

Uses execution logs from production MEV searchers operated by the affiliated company as a proxy for ground truth. An RL agent (GNN encoder + MLP value head, trained via PPO on historical Polygon data, <10 ms inference) monitors pending transactions and, for each candidate T_i, computes the optimal arbitrage route and maximum bid yielding positive expected profit. When the bot submits a bid on a route triggered by T_i, this is interpreted as evidence that T_i was identified as the primary opportunity creator at submission time.

Coverage is limited to 38.4%: the bot’s real-time mempool visibility misses last-in-block arbitrage by design. Bot data reflects searcher intentions at submission time, not retrospective causal analysis of finalized blocks, which is why it serves as validation rather than primary attribution.

Note: the bot-data component relies on proprietary bidding infrastructure and will not be released as part of the open artifacts.


3. Empirical results

Datasets. Large-scale analysis: blocks 83,770,001–84,820,000 on Polygon, 360,026 atomic arbitrage events, $334,799 extracted (USD at execution-time oracle prices; internal computation in MATIC). Method comparison: blocks 82,546,747–82,567,395 (~12-hour window, February 4, 2026), 2,526 atomic arbitrage events, used for exhaustive Shapley computation and all four-method comparison. Ground truth coverage in the February dataset: bot consensus 38.4%, exact Shapley 12.0% (candidate sets with fewer than 20 transactions, where exact computation is feasible), 23% attributed to pre-block state (opportunities pre-existing before the block, accounting for less than 4% of total profit). Implementation: modified Geth archive node, Rust for performance-critical components, Python for statistical analysis; 32-core cluster (Intel Xeon Platinum, 128 GB RAM).

Method Metric Coverage Mean time/event
Bot-data 94.2% accuracy* 38.4% 8 ms
Simulation 91.7% accuracy* 99.1% 12.3 ms
Coefficient 77.2% agreement with simulation** 88.4% 0.8 ms
Shapley (exact) 100% accuracy* 98.1% ~5 min
Shapley (MC, 1k) 100% accuracy* 98.1% 2.1 s

*Accuracy measured against triangulated ground truth (bot consensus + exact Shapley + manual review of 200 stratified events).

**Not evaluated independently against ground truth; measured as agreement with simulation output.

Simulation processed the full March 2026 dataset (360,026 events) in ~80 hours, consistent with 12.3 ms/event.

3.1 Shapley case study: block 82,563,006

Figure shows Shapley attribution for a representative event. Four candidate transactions interact with the arbitrage route: a non-arbitrage swap (index 129), two competing arbitrageurs (indices 130, 168), and the executed arbitrage (index 199). The non-arbitrage swap at index 129 receives the largest positive attribution: +32.58 MATIC, it created the price disbalance. The competing arbitrageurs receive negative Shapley values: positive values indicate opportunity creation; negative values indicate profit consumption by competing arbitrageurs.


Shapley attribution for an arbitrage event (block 82,563,006). Positive values indicate opportunity creation; negative values indicate profit consumption by competing arbitrageurs. The non-arbitrage transaction at index 129 is the primary source (+32.58 MATIC).

3.2 Monte Carlo convergence

For |C| \geq 20, exact Shapley is computationally intractable. Figure shows convergence for transaction 0xb1f2a5bb... in block 82,554,874, which has 16 candidates (|C| < 20, so exact values are available and shown as dashed lines). Four representative candidates with varying Shapley values: one near-zero, one exactly zero, two negative. Estimates stabilize within 5% of exact values after approximately 500 samples, validating the default of 1,000 samples.

Monte Carlo Shapley convergence for transaction 0xb1f2a5bb.. (block 82,554,874). Four subplots show convergence for candidates with varying Shapley values (one near-zero, one exactly zero, two negative). Exact Shapley values shown as horizontal dashed lines; Monte Carlo estimates (mean ±1 std over 100 runs) shown as points with shaded regions. Estimates stabilize within 5% after ∌500 samples.

3.3 Single-source hypothesis

Figure shows a complex case from the same block (transaction 0xb1f2a5bb, block 82,554,874): 17 transactions with non-zero Shapley values, including the arbitrage transaction itself. Attribution is still dominated by a single source.

Across the February 2026 dataset (2,526 events): 83 events (3.3%) exhibit tied maximum Shapley values indicating genuine multi-source creation. Among these, 71 have 2 tied sources, 8 have 3, and 2 events have 4 or 6 tied sources. 42 of 83 (50.6%) are “blind” last-in-block arbitrage where all positive contributors share equal Shapley values; the remaining 33 involve cascades of interdependent arbitrageurs. For the dominant 96.7% single-source cases, one transaction accounts for >70% of total positive Shapley value.

Shapley attribution for complex arbitrage (block 82,554,874, transaction 0xb1f2a5bb..). Seventeen transactions connected by non-zero Shapley values. Despite multiple participants, attribution remains dominated by a single source

3.4 Concentration and protocol distribution

Concentration (February 2026 dataset, Figure 6): the top 1% of arbitrageurs capture 80% of extracted value; the top 1% of opportunity-creating transactions generate a similar proportion of MEV opportunities. Each opportunity-creating transaction attracts only ~1.6 successfully executed arbitrage transactions on average, reflecting standard multi-bid practice rather than excessive competition. This ratio is only measurable via attribution.


MEV concentration: accumulated MEV value by percentile of top arbitrageurs (blue) vs. opportunity-creating transactions (orange). Top 1% of each group accounts for 80% of extracted value, yet the executed arbitrage-to-opportunity ratio remains ∌1.6:1. Analysis based on 220,262 opportunitycreating transactions from February 2026 dataset.

Protocol distribution (February 2026 comparison dataset, 220,262 opportunity-creating transactions): 96.5% involve identifiable AMMs across 18 unique protocols. Participation frequencies sum >100% because individual transactions often interact with multiple AMMs:

Protocol Participation frequency
Uniswap V3 58.0%
Algebra 29.6%
Uniswap V4 28.9%
Uniswap V2 23.2%
DODO 8.2%

Uniswap V3 and Algebra lead despite Uniswap V2 having higher overall trading volume. The paper notes that concentrated liquidity mechanisms, “while capital-efficient, create more frequent price disbalances exploitable by arbitrageurs.”

23% of events in the validation dataset are attributed to pre-block state: the opportunity pre-existed at block start and was not created by any in-block transaction in the search window. These account for <4% of total extracted profit.


4. Design implications

Protocol designers. Attribution identifies which protocols generate the most MEV leakage. The data shows that concentrated liquidity mechanisms, while capital-efficient, create more frequent exploitable disbalances, a concrete per-protocol signal for MEV risk.

Validators. Transaction ordering policy has been analyzed from the extractor’s perspective. Attribution adds the creator’s: a transaction creating a large attributable opportunity imposes a measurable externality on other users. The paper suggests ordering policies could account for opportunity creation to reduce aggregate MEV leakage.

MEV risk assessment. The ~1.6:1 executed arbitrage-to-opportunity ratio, measurable via attribution, provides a per-protocol observable for evaluating competitive intensity and market efficiency over time.


5. Limitations

Three primary limitations stated in the paper: (1) ground truth relies on triangulation rather than direct observation, causal relationships are not directly observable on-chain, introducing potential bias if validation sources share systematic errors; (2) evaluation is limited to Polygon (February–March 2026) network-specific factors (block time, searcher competition, gas pricing) may affect attribution dynamics on other chains; (3) scope is limited to atomic arbitrage, liquidations, sandwich attacks, and top-of-block opportunities have distinct causal structures requiring adapted models.

Future directions: real-time attribution, extension to other MEV categories, multi-chain comparison including Ethereum mainnet and L2s, attribution-aware ordering protocols.


Full paper: arXiv:2604.27979.

Authors: Andrei Seoev (MEV-X), Dmitry Belousov (MIPT), Anastasiia Smirnova (MEV-X), Ksenia Kurinova (MIPT), Aleksei Smirnov (MEV-X), Denis Fedyanin (HSE University), Yury Yanovich (Skoltech). Submitted to SIGCOMM’26.

5 Likes

This is the right question to be asking, and the counterfactual framing (which preceding state transition created the disbalance) is the correct one because EVM replay makes it exact rather than statistical. I want to push on one thing in the result, because I think it is more interesting than it first looks. The finding that 96.7% of opportunities trace to a single source transaction, where one transaction holds more than 70% of the positive Shapley value, is also telling you something about the game you are attributing over: on that 96.7% the game is effectively additive, or single player dominated. And on an additive game the Shapley value is not doing cooperative work. It collapses to the normalized marginal contribution, which is a Copeland or max impact share. So for almost the entire dataset your expensive Shapley ground truth and a cheap coefficient method are not agreeing by luck, they agree by construction, because there is no synergy for the cooperative machinery to resolve. The Shapley value earns its keep precisely on the 3.3% genuinely tied cases, and those are the events where causation is actually shared rather than localized.

I think that partition is the real science in here, not a caveat. If you split events by how concentrated the Shapley vector is, the concentrated set is where attribution is easy and additive and any coefficient method should recover it, and the spread set is where the cooperative game is load bearing and the cheaper methods should visibly degrade. A concrete prediction worth checking: the K value method’s accuracy should sit near the simulation method on the concentrated 96.7% and fall off on the tied 3.3%. If that holds, you have a principled and free way to route events, exact attribution only where the game is non additive, fast coefficient attribution everywhere else, and you can say which events needed cooperative game theory rather than running it on all of them.

The second thing is what attribution buys you downstream. A single transaction origin for 96.7% of opportunities is evidence that the opportunity is a surface of one structural relationship, the ordering discretion over that transaction’s price effect, rather than an emergent multi transaction phenomenon. That reframes mitigation. You can attribute the opportunity and police the surface, which is what most MEV mitigation does, or you can remove the precondition the surface rides on. A batch that collects orders over a window, clears them at one uniform price, and fixes intra batch order by a seed no single party controls has no per transaction price disbalance for anyone to attribute in the first place, because there is no privileged position in the sequence and no per order price to improve by moving you. Attribution measures where the value comes from with real precision. It does not by itself remove it, and the conservation intuition is that policing an attributed surface tends to relocate the value to the next surface of the same precondition rather than reduce it.

None of this is a knock on the method, the counterfactual exactness is the part the field has been missing and the single source result is a genuinely useful empirical fact. I am mostly flagging that your own data is drawing the line between where the cooperative game is necessary and where it is ceremony, and that the line is a tool.

This is an incredibly sharp read, and honestly, one of the most insightful comments we’ve ever received on our work!

Your insight about the 96.7% additivity and the collapse of the Shapley value is brilliant, and it perfectly aligns with a new direction we are actively developing in a follow-up paper we are currently finalizing. You essentially predicted the exact theoretical bottleneck we are trying to solve: if the game is additive, the expensive cooperative machinery is just ceremony, and we need a principled way to bypass it computationally.

In this new work, we train a lightweight MLP surrogate to approximate expensive counterfactual simulations using only observable on-chain signals. Your “concentrated vs. tied” partition perfectly maps to our learning theory: the 96.7% additive cases correspond to smooth, low-dimensional submanifolds where our surrogate converges almost instantly (we prove a sample complexity bound showing domain-aware features reduce the effective hypothesis dimension). The 3.3% synergistic cases are exactly where the surrogate’s uncertainty spikes. We are now exploring Contextual Multi-Armed Bandits to dynamically route compute: using the O(1) coefficient method for additive cases, and escalating to the surrogate or exact simulation only when uncertainty detects non-additive synergy. Crucially, this precise attribution directly informs protocol-level mitigation. Instead of just “policing the surface” (which merely relocates MEV), this logic can be embedded directly into Uniswap v4 afterSwap hooks. A hook can detect synergistic, toxic flow and dynamically recycle a portion of the arbitrageur’s profit back to the LP, or revert the trade if it breaches the Impermanent Gain (IG) zone — transforming attribution from a passive measurement tool into an active, on-chain coordination mechanism.

We are currently working on the next draft of this surrogate framework and will be pleased to share it with you once it’s ready. Your phrasing of “where the cooperative game is necessary vs. where it is ceremony” is going straight into the introduction — if you don’t mind! Also, since you brought up mitigation and LP incentives, we recently released another preprint on quantifying profitability zones (Impermanent Gain vs. Loss) for LPs and arbitrageurs (https://arxiv.org/pdf/2604.28014). It formalizes the exact “win-win” symbiotic zones that these hooks would be trying to protect, so it perfectly complements the mitigation side of your comment. Let’s definitely keep in touch!

1 Like

Thanks seoeva, and the follow-up direction is the right one. An MLP surrogate over on-chain signals plus a contextual bandit that routes O(1) coefficient on the additive mass and escalates on detected non-additivity is exactly the learning-theoretic version of the concentrated-versus-tied partition, and afterSwap is the natural place to put it. Quote any of it freely.

Let me sharpen the half of my first comment I underplayed, because it matters for what your router escalates to. I spent that comment on where Shapley is redundant. The more useful claim is where it is irreplaceable, and how large the gap is there.

The clean way to say it is in terms of Harsanyi dividends. Any cooperative game decomposes into coalition dividends, and the Shapley value is the unique efficient, symmetric, linear rule that distributes every dividend to its members. A coefficient or pro-rata method captures only the singleton dividends, the standalone marginal values, and silently drops the higher-order dividends that live on coalitions of size two and up. On your 96.7%, where one transaction holds the dominant positive share, the higher-order dividends are near zero, so the cheap method and Shapley agree by construction rather than by luck. That is the collapse. On the genuinely tied remainder, the higher-order dividends are not a correction term, they are the entire object being measured, and only a rule that distributes them survives the four axioms at once. That is the regime where Shapley is not one option among four but the only consistent answer.

The piece I want to correct is magnitude. It is tempting to read 3.3% as a rounding error you can afford to approximate. I made that mistake in the other direction by emphasizing the collapse. In a DEX liquidity-reward mechanism I work on, the value function splits the same way yours does. There is an additive component, raw liquidity supplied, where pro-rata is exactly right and I would never reach for Shapley. Then there are non-additive components: the scarcity of whichever side was actually enabling trades to clear, and presence through volatility rather than after it. On the additive mass the two methods agree. On the non-additive tail, by construction of the value function, the Shapley allocation reweights individual rewards by up to roughly a factor of two and reorders the ranking outright. In a worked example the smallest position by capital becomes co-largest by reward, because that position was the scarce side the batch could not clear without. This is the designed allocation rather than a production measurement, but the point is structural: a minority of cases carry the divergence, and the divergence on those cases is large, not marginal. I would expect your empirical tail to behave the same way, which is the measurement I am most curious about below.

That has a concrete implication for the surrogate. Use it to route and it is excellent, because detecting non-additivity is a smooth, learnable signal. Use it to approximate the Shapley output on the non-additive tail and it will quietly fail, because an MLP minimizing average error smooths over exactly the rare high-order dividends that produce the divergence. Those cases look like outliers to the loss function, so the surrogate regresses them toward the additive prediction, which reintroduces the bias the whole exercise is meant to remove, on the cases where the bias is the entire point. Your bandit routing sidesteps this cleanly. A learned value-approximator on the tail would not, and it would underclaim divergence the same way I did.

One empirical question I would love to see you run, because it is the claim that ties this together. You report the 3.3% as a count share. Have you measured it value-weighted, by extracted profit rather than by event? My expectation, from the liquidity side, is that the synergistic tail is a small minority of events and a large share of strategically meaningful value, the same way the scarce enabling side is a minority of capital and a majority of cleared volume. If that holds in your Polygon data, the headline is not that 96.7% is cheap to attribute. It is that the expensive 3.3% is where most of the value, and all of the interesting coordination, actually lives.

There is a further reason the static gap understates the divergence, and it bears directly on your plan to push attribution into afterSwap hooks. The moment an attribution rule is embedded in the protocol it stops being a measurement and becomes a price, and participants optimize against the price you post. An additive approximation that is merely a little inaccurate as a measurement becomes a systematic misprice once it is in the loop, because it underpays exactly the synergistic opportunity-creation it cannot see, and that misprice selects against that behavior over time. The per-event reweighting I described is a one-shot snapshot. The behavioral divergence between an additive rule and a Shapley rule compounds across settlements through second-order effects on who acts, when, and in what combination, so the gap after a season of feedback is larger than the gap on any single event, in ways that are hard to bound in advance. That is the strongest argument for paying the Shapley cost on the tail rather than approximating it. On the 96.7% the approximation is harmless. On the 3.3% it is not a small error, it is a biased incentive, and biased incentives do not stay small.

Happy to share the liquidity-reward divergence breakdown if a second instance of the additive versus non-additive split is useful to the follow-up paper. The structure transfers cleanly across the two domains, which is usually a sign the abstraction is the real thing