Fork Choice compliance test suites & test generator

ericsson49 · July 2, 2024, 8:00pm

This is a preliminary announcement, we’ll officially announce during the next All Core Devs call.

We (TxRx team, ConsenSys) have implemented a Fork Choice compliance test generator as well as have generated Fork Choice compliance test suites.

Overall F/C compliance testing methodology is described here.

In this report we briefly describe the results of the initial implementation phase (i.e. the F/C test generator and F/C test suites). A more detailed description of the work is TBD.

This work was supported by a grant from the Ethereum Foundation.

Implementation status

Test generator

The initial version of the Fork Choice tests generator is implemented and currently available as a draft consensus-specs PR. We have been focusing on minimizing efforts for client implementer teams to adopt the generated tests. The only a small change to the existing FC test format is the addition of a new check, which is safe to ignore initially.

Test suites

We have developed test generation parameters for three suites at the moment.

Test suite	size	Purpose	Status	Link
Tiny	135 tests	Demonstration, smoke testing	Done	link
Small	1472 tests	Initial adoption, smoke testing	Done	link
Standard	13240 tests	Main testing	Done	link
Extended	about 100K tests	Extended testing	TBD

Note: We are able to generate the Extended test suite. However, it will take significant time (about a week), therefore, we have delayed actual test suite generation until it will be demanded.

It should be possible to generate test suites for any fork (Altair, Capella, Deneb) and preset (mainnet or minimal). However, test generation for mainnet is very slow. We have tested minimal/altair and minimal/deneb.

Test generation currently is slow (about 10-15 seconds per test on average). However, a multiprocessing mode is supported (about 2 seconds per test on Apple M1). Generation of the Standard test suite takes about 8 hours (multiprocessing mode) or two days (single process mode).

The reasons of slow performance are known and are to be alleviated in future. Currently, our top priority is to simplify adoption of the new test suites.

Testing the tests

We have run the generated tests against Teku, using Teku test runner and against the official executable Fork Choice spec (minimal/deneb), using a simple Python test runner.

Test generation approach

The test generation approach is a mix of model-based and fuzz testing.

Principles:

the Fork Choice spec is virtually “decomposed” into two parts: topological sorting of events and actual event processing
tests are generated for the event processing part, the topological sorting part is addressed via event shuffling (time shift plus drop/duplication)
models are used to describe the spec aspects that we want to cover. There are two flavors: trees of various shapes (for block trees and super-majority link trees) and predicates to be covered (filter_block_tree)
for each model there can be multiple solutions, each solution can be seen as a template (e.g. SM link tree + block tree) which can be instantiated in multiple ways (varying validator actions)
each test case can be mutated multiple times

Tests are generated with four steps:

Models (implemented using MiniZinc), describing abstract coverage aspects that we want to cover. Currently there are three models: SM link (super-majority link) tree model, Block tree model and Block cover model.
For each model a set of solutions is produced. The models are parameterized, which affects the size of solution set generated.
- SM link and block tree solutions are combined into a single block tree.
Each solution is instantiated using two test instantiators (block tree and block cover). The instantiation is randomized, i.e. a coin is flipped on each decision point. This results in a complete Fork Choice test case (i.e. anchor state plus a sequence of tick | block | attestation | attester_slashing events).
Each test case is mutated via mutation (shuffling) operators. Currently, there are thee mutation operator: time shift, drop and duplicate (with consequent shifting).

The models are developed manually.
Solutions to the models are produced with a special generator.
Test instantiators and mutations are performed with test_gen.py.

After tests are generated, one can validate the produced test steps using test_run.py script, which executes the steps using the pyspecs, performing prescribed checks.

Test structure

Test group	size (standard suite)	parameters (solutions + variations + mutations)	description
Block tree	4096 tests	10242(1+1)	focus on trees of varying shapes
Block weight	2048 tests	864(1+3)	focus on producing block trees with varying weights
Shuffling	2048 tests	84(1+63)	focus on shuffling/mutation operators
Attester slashing	1024 tests	816(1+7)	focus on attester slashing
Invalid messages	1024 tests	832(1+3)	focus on invalid messages
Block cover	3000 tests	605(1+9)	cover various combinations of predicates from the `filter_block_tree` method

Future steps

improve performance. Performance is adequate right now (for the initial adoption phase). But is the main blocker otherwise.
more flexible test generation. More and better models, better instantiators, better mutation operators.
coverage-guided fuzzing
new test vector format (don’t need full test cases for fuzz testing, as need to compare against the FC spec anyway)