Do we propagate transactions up the tree, requiring shards to process transactions irrelevant to them (thus somewhat defeating the scaling gains achieved by sharding)?
The latter. Though the gains from scaling will still remain. Suppose that we use a scheme which assigns each shard to a pair shard, cycling through all possible pairs (eg. there are p shards with prime p, in block N we match shards (N mod p, 2N mod p), (3N mod p, 4N mod p), etc). During each block, executors will have to download a witness for every transaction in the corresponding shard, so it increases the amount of work required by at most a factor of 2 (if every transaction is cross-shard), plus a bit more for witness overhead. If we cycle through leader shards that can affect anything, then executors of every other shard will have to execute the transactions of the leader shard first, and executors of the leader shard will have to download witnesses for the portions of the activity that touch other shards, once again increasing the workload at most by a factor of 2 plus witness overhead.