DRAFT
Lessons From Adopting Generative Testing
Over the past few months, I rewrote most of my test suite. It was a cleanup effort as well as a chance to rethink how I approach testing, something Iāve been neglecting for most of my programming career.
There were two main reasons for rewriting my tests. The first reason was that the codebase Iāve been working on has grown, and more people are now working on it, however the existing tests werenāt adequate. At the time, they were the largest source of tech debt in the project, and Iām mostly to blame for that⦠Most of the tests were half-broken, vibe-coded slop just there to satisfy Codecov. [1] [1]
The other reason was that I wanted to try generative testing. Iām a big fan of TigerBeetle and Antithesis, both well known for their focus on software correctness and rigorous testing. Iāve read many of their blog posts and learned a lot of techniques and ideas, which Iāve always wanted to apply to my own projects.
The combination of tech debt pressure and unsatisfied curiosity eventually pushed me to delete my existing tests and rewrite them the way I imagined they should be. This post goes through that process and what I learned from it.
Notes on Terminology
Before going into more details, I just want to clarify some terms Iāll use. There are a lot of overloaded terms in testing (and software in general), and people often use the same words to mean different things. You may disagree with these, thatās fine. This is just what I mean when I use them.
I use only two scopes for tests:
- Unit tests: Tests scoped to any component or subsystem of the full system. The size doesnāt matter. It can be a single function, a module, or a group of modules. Some people will call tests that span multiple units āintegration testsā, but I disagree. That definition is too ambiguous. I use the term āintegrationā to describe a different concept.
- System tests: Tests scoped to the entire system. I also consider tests that replace production components with test versions to still be system tests, as long as the overall system is exercised as a whole.
I use the term integration tests for tests (any scope) that interact with āblack boxesā, e.g. third-party services, filesystems, networks, databases, etc. Or, in functional programming terms, impure tests. Tests without āintegrationā in their name implicitly mean pure tests. Also, I use the term end-to-end (E2E) tests specifically for system-scoped integration tests.
Rethinking My Testing Approach
Before starting the rewrite, I spent some time thinking about how to approach it and what my tests should look like. As you can already tell from the title of this post, most of the rewrite involves writing generative tests.
Generative testing is basically a catch-all term for tests where you write code that generates test cases, instead of you specifying the test cases themselves. This includes randomised approaches like property-based testing or fuzzing, and non-random ones where you try every possible input.
Note, Iāll use the terms āPBTā and āfuzzingā interchangeably going forward. Both involve using randomised input to test some property of the system. There are subtle differences, but in practice theyāre minor enough that I just consider them the same. [2] [2]
Here are some of the main ideas that guided the rewrite:
Properties Over Examples
Property-based testing (PBT) is a testing approach where you specify properties of your code and throw randomised input at it to break those properties. Iām not going to explain it in detail here as there are already plenty of good resources online, and I donāt have anything new to add. [3] [3] Here, Iāll just talk about my case for PBT.
While the system is still small, I wanted to catch most bugs early so they donāt compound as the system grows. The existing example-based tests are limited in this regard, as they only cover cases I already know. PBT is much more effective as they uncover cases and scenarios that we forgot or may never have thought of.
In other words, property-based tests find bugs; example-based tests catch regressions.
My plan was to replace most of the existing tests with property-based tests, keeping a few select example-based tests, which are still useful for:
- Smoke tests: Sanity checks of the happy path, sometimes including common error cases . I mainly use them for code that doesnāt need extensive testing, e.g. thin wrappers or simple glue code.
- Documentation: Example-based tests are more readable, so I leave some to illustrate API usage or internal behaviour.
- Regression tests: Dedicated tests to ensure new changes donāt reintroduce known bugs.
System-Wide Fuzzing
System tests are usually the most valuable as they validate your system as a whole. The hardest bugs usually emerge only when components are used together. PBT/fuzzing is scope-agnostic, so it makes a whole lot of sense to also apply it in system tests.
There are two approaches I went with for this: the infamous deterministic simulation testing (DST) and traditional full-system fuzzing. [4] [4]
DST is basically just a system-scoped property-based test. Sure, it involves extra steps like replacing non-deterministic components (e.g. network, disk, clock, etc.) with deterministic counterparts, but the overall flow is the same as regular fuzzing or stateful PBT. If you like to learn more about DST, I recommend reading Antithesisās DST primer or Phil Eatonās blog post.
The main pitch of DST is that it finds the most outlandish bugs for you while keeping them perfectly reproducible, but as I eventually learned, itās not the silver bullet I initially thought it was. More on this later!
The other approach is an E2E test where we fuzz the system in its production configuration, i.e. replacing the non-deterministic components. This means that the test runs in wall clock time and that they may be flaky, but it provides more coverage as the bugs it catches may not show up in DST. This follows the same principles as TigerBeetleās Vortex.
Testing with Intent
This isnāt related to generative testing, but I still think itās worth including. One thing I started to pay more attention to now in my tests is the intent or thought behind it.
Itās a similar concept to design. If you read design case studies or write-ups, you might be surprised at the amount of thought put into seemingly trivial details. Every decision is made on purpose. Itās the same for code. I think through the programās architecture and how to structure the code, but Iāve always neglected doing it for tests.
You may argue that tests already have a purpose by default, e.g. to verify correctness, catch regressions, etc. Iām talking about stuff beyond that. Stuff like:
- Is this test testing intended behaviour, or is it just copying your implementation? The latter gives you false confidence, e.g. you can cover every branch and still be wrong.
- How should I test this? Not everything needs the same level of testing. Sometimes a simple smoke test is enough vs a full-blown exhaustive test.
- Do I need a test for this? Do I care enough about this to add a test? Every code you add is code you have to maintain.
Thinking about a testās intent also forces you to reason about the SUT. I treat it as a design exercise; others describe it like an experimentās hypothesis. Iāve also started adding comments on my tests to make them easier for other developers to understand, e.g. why it exists, what behaviour/properties it checks, how it actually verifies them, etc.
The System Under Test
Hereās a brief overview of the SUT to give you context for the examples later. All examples in this post are drawn from Cardinal, the game server component of the World Engine stack. Itās written in Go.
To be more specific, Cardinal is a tick-based game server framework built around an archetypal entity-component-system (ECS) core. Users implement game logic as systems, which are executed each tick to advance the world state. Cardinal communicates with other services through NATS and persists state using periodic snapshots.
Every tick, Cardinal injects commands received from external services (e.g. game clients) into the ECS core and runs the registered systems. These systems process commands, update the world state, and emit events. After the tick ends, event handlers publish the emitted events to NATS for other services to consume.
Hereās a diagram showing Cardinalās high-level architecture:

Ok, hope that made sense. Now, on to the testing techniques.
Property-Based Testing Techniques
The hardest part of PBT is figuring out what properties to test, so I think one of the best ways to learn is to just see a lot of examples and start recognising patterns. Most of the techniques mentioned here came from TigerBeetleās blog and source code, as thatās where I learned them. If youāre also interested in PBT/fuzzing, definitely give them a read!
Testing Data Structures
Cardinal is mostly made up of data structures. Testing them is a bit different from pure functions often used in PBT examples. With pure functions, you simply assert on the output. With data structures, there are two ways to do it:
- Assert internal properties (invariants). An invariant is a condition that must always be true. You can test these if your tests have access to internal fields, or by placing assertions directly in the implementation.
- Assert against a model. This is called model-based testing (MBT), or stateful PBT, state machine testing, differential testing, etc. In MBT, you compare the behaviour of your implementation against a model with the same semantics. The model can be a simplified implementation or a trusted reference, e.g. a data structure from your languageās standard library.
The idea is simple: run the same operations on both your implementation and the model, and assert that their states match.
Iāll use Cardinalās componentManager to illustrate. Hereās what it looks like:
type componentManager struct {
nextID componentID // The next available component ID
catalog map[string]componentID // Component name -> component ID
factories []columnFactory // Component ID -> column factory
}
// register registers a new component type and returns its ID.
func (cm *componentManager) register(name string, factory columnFactory) (componentID, error) {}
// getID returns a component's ID given its name.
func (cm *componentManager) getID(name string) (componentID, error) {}
While the manager does more internally, the bulk of its work is for component registrations, which basically involves mapping component IDs to their names. Since this behaviour is essentially like a map, I used Goās built-in map as the model because it has the same semantics. Hereās a simplified version of the test:
func TestComponent_RegisterModelFuzz(t *testing.T) {
prng := testutils.NewRand(t)
impl := newComponentManager()
model := make(map[string]componentID)
operations := []string{opRegister, opGetID} // Just string constants
for range 10_000 {
switch testutils.RandOp(prng, weights) { // Pick a random operation
case opRegister:
name := randValidComponentName(prng)
implID, implErr := impl.register(name, nil)
modelID, modelExists := model[name]
require.NoError(t, implErr)
if modelExists {
// Property: duplicate registrations return same ID.
assert.Equal(t, modelID, implID)
} else {
model[name] = implID
}
case opGetID:
name := randProbablyRegisteredComponentName(prng) // Can return non-existent keys
implID, implErr := impl.getID(name)
modelID, modelExists := model[name]
// Property: getID returns same existence and value as model.
assert.Equal(t, modelExists, implErr == nil)
if modelExists {
assert.Equal(t, modelID, implID)
}
}
// Property: implementation matches model.
assert.Len(t, impl.catalog, len(model))
for name, modelID := range model {
implID, err := impl.getID(name)
require.NoError(t, err)
assert.Equal(t, modelID, implID)
}
}
Now you might be thinking: wait, doesnāt the implementation also use a map? The model is the same as the implementation. Your test essentially just compares two different maps here. Is this test even doing anything?
I had similar doubts too when I first wrote this. The benefits of this test arenāt obvious at first (to me at least), but it becomes clear once you change the implementation. For example, I could replace the map with an exotic data structure to get a 100x performance boost. This test will still work (as long as the semantics are the same, of course) because it tests the behaviour, not the implementation.
Note, MBT and internal assertions arenāt mutually exclusive. Ideally, you should use both. Itās not shown in the example above, but I also have assertions in the implementation to guard against invalid states. They check for different things: internal assertions ensure invariants hold, MBT ensures it behaves correctly.
Swarm Testing
I complement all my model-based tests with swarm testing. In the previous example, the test picks a random operation, where every operation has the same chance of being selected (a uniform distribution). In practice, this usually doesnāt lead to interesting scenarios.
The idea behind swarm testing is to randomise the test configuration. In our case, that means randomising which operations are performed. We can do this by picking a subset of all possible operations and assigning each a random weight (probability of being picked).
TigerBeetle uses Zigās metaprogramming magic to automatically derive operations from a structās fields. You could probably do this in Go using runtime reflection, which I only realised while writing this⦠What I ended up doing was writing helper functions that take a hardcoded list of operations and randomises them:
type OpWeights = map[string]uint64 // Operation names -> weights
// RandOpWeights selects a random subset of operations and assigns random weights to them.
// At minimum 1 operation is enabled, at most all operations are enabled.
func RandOpWeights(r *rand.Rand, ops []string) OpWeights {
assert.That(len(ops) > 0, "you need multiple operations to randomize")
// Randomly select how many operations to enable (1 to len(ops)).
numEnabled := 1 + r.IntN(len(ops))
// Shuffle and take the first numEnabled operations.
shuffled := slices.Clone(ops)
r.Shuffle(len(shuffled), func(i, j int) {
shuffled[i], shuffled[j] = shuffled[j], shuffled[i]
})
weights := make(map[string]uint64, numEnabled)
for i := range numEnabled {
weights[shuffled[i]] = uint64(1 + r.IntN(100))
}
return weights
}
// RandWeightedOp returns a random weighted operation from a OpWeights.
func RandWeightedOp(r *rand.Rand, ops OpWeights) string {
var total uint64
for _, weight := range ops {
total += weight
}
pick := r.Uint64N(total)
for op, weight := range ops {
if pick < weight {
return op
}
pick -= weight
}
panic("unreachable")
}
Hereās what the previous test looks like with swarm testing:
func TestComponent_RegisterModelFuzz(t *testing.T) {
prng := testutils.NewRand(t)
// Get a random weighted subset of your operations.
operations := []string{opRegister, opGetID}
weights := testutils.RandOpWeights(prng, operations)
for range 10_000 {
op := testutils.RandWeightedOp(prng, weights)
switch op {
case opRegister:
// ...
case opGetID:
// ...
}
// ...
}
Initially, I thought regular MBT was enough, until I found a bug in one of my code. Itās the type of bug that can be found faster through swarm testing. Hereās a snippet of the buggy code from Cardinalās event manager type. Can you spot the issue?
func NewManager(channelCapacity int) Manager {
return Manager{
channel: make(chan Event, channelCapacity),
// ...
}
}
// Enqueue enqueues an event into the Manager.
func (m *Manager) Enqueue(event Event) {
e.channel <- event
}
The bug here is that Enqueue blocks when the channel capacity is reached.
The existing tests never found it, and it never happened during local development, so this bug went unnoticed for a long time. While I was writing the model tests, I experimented with increasing the number of operations. The tests eventually got stuck, revealing the bug.
This is exactly the kind of bug swarm testing helps catch. The original test runs both enqueue and dequeue operations, so the channel never fills up. If the dequeue operation is disabled, or is less likely to be picked, the channel quickly reaches capacity, and the bug becomes obvious!
After fixing the bug, I added a regression test to ensure it doesnāt happen again:
func TestEvent_EnqueueChannelFull(t *testing.T) {
synctest.Test(t, func(t *testing.T) {
const channelCapacity = 16
const totalEvents = channelCapacity * 3 // Well beyond channel capacity
impl := event.NewManager(channelCapacity)
// Enqueue more events than channel capacity.
go func() {
for i := range totalEvents {
impl.Enqueue(event.Event{Kind: event.KindDefault, Payload: i})
}
}()
// Wait for all goroutines to complete or durably block.
// If enqueue blocks, synctest.Test will detect deadlock and fail.
synctest.Wait()
// Extra property checks...
}
This test uses Goās new testing/synctest to detect deadlocks. Unlike traditional timeout-based tests, goroutines spawned inside the synctest.Test bubble are tracked, which allows the runtime to detect deadlocks/leaks quicker and deterministically.
Concerns with Model-Based Testing
A big concern I had when writing these tests was how do I know whether my model is correct. Iām usually confident when the models are from Goās standard library, since theyāve probably been tested to death by maintainers and by users in production.
But sometimes the standard library doesnāt have anything with the same behaviour as the data structure Iām testing. In these cases, Iām forced to write my own simplified model implementation. The hard part was verifying the model. How do I know if itās correct? Do I test my model? What do I test it against? Should I test my tests? Itās turtles all the way down.
Turns out this is a well-known problem called the test oracle problem. A test oracle is basically what determines the correct output of a test. In example-based tests, you are the oracle because you specify the expected output. In MBT, the model is the oracle. As far as I know, thereās no complete solution this problem. [5] [5]
My only practical suggestion is to make your models (and tests in general) āobviously correctā. Prioritise simplicity and clarity above everything else. Keep the code short and easy to audit. It doesnāt matter if itās slow or inelegant. Ideally, you should be able to convince yourself (and others) that the code is correct just by reading it.
Reducing the State Space
Some behaviours are surprisingly difficult to test, especially when they depend on many interacting variables. The number of possible variable combinations grows quickly, leading to a state space explosion. When this happens, uniform random sampling becomes less effective as most samples fall into āempty spaceā rather than edges and boundaries where bugs often hide.
Hereās an example. Cardinalās ECS has the concept of archetypes. An archetype represents a unique combination of component types. It stores a list of entities belonging to that archetype as well as their component data. Hereās what it looks like:
type archetype struct {
id archetypeID // Corresponds to the index in the archetypes array
components bitmap.Bitmap // Bitmap of components contained in this archetype
rows sparseSet // Maps entity ID -> row index in entities and columns
entities []EntityID // List of entities of this archetype
columns []abstractColumn // List of columns containing component data
}
When components are added to and/or removed from an entity, it is moved to a different archetype with the required component set. This operation involves creating the entity in the destination archetype, copying its components, and removing it from the source archetype:
// moveEntity moves an entity from one archetype to another.
func (a *archetype) moveEntity(destination *archetype, eid EntityID) {
if a == destination {
return
}
row, exists := a.rows.get(eid)
assert.That(exists, "entity is not in archetype")
// Create a new entity in the destination archetype with the id.
destination.newEntity(eid)
newRow, exists := destination.rows.get(eid)
assert.That(exists, "new entity isn't created in the destination archetype")
// Move entity's components to the new archetype.
for _, dst := range destination.columns {
for _, src := range a.columns {
if dst.name() == src.name() {
value := src.getAbstract(row)
dst.setAbstract(newRow, value)
}
}
}
// Remove the entity from the current archetype, which also updates the row mapping.
a.removeEntity(eid)
}
There are a lot of variables involved in a move operation:
- The source and destination archetypes.
- The number of components added, removed, or copied.
- The number of entities in both archetypes.
- Which entity is being moved.
If we only consider the source and destination archetypes, the input space is already huge. If there are
To make testing this more effective, I intentionally limited the size of the state space. The rationale is based on the small scope hypothesis, which argues that most bugs can be found within small input sizes. [6] [6] Larger inputs donāt necessarily lead to new execution paths. Borrowing the words of Matklad:
Most of the bugs involve small, tricky examples. If a sorting routine breaks on some array with ten thousand elements itās highly likely that thereās a much smaller array (a handful of elements), which exposes the same bug.
With a smaller scope, we can now feasibly test all possible variable combinations exhaustively. Some of you might already know where Iām heading here. Yes, Iām talking about Exhaustigen. Itās a helper library for generating combinations of multiple variables exhaustively. I ported TigerBeetleās Zig version to Go, which you can find here.
For my archetype move test, I reduced my variables to:
- Maximum 6 component types, which results in 64 possible archetypes. This produces
source-destination pairs. I arrived at this number after some experimentation. Higher counts lead to exponentially longer test runs. - 1-3 entities in the source and 0-2 entities in the destination. The source must have at least one entity to move.
- The entity to be moved is randomly chosen from the source. Entities and components are stored in slices under the hood. This verifies moves work correctly regardless of whether the entity is at the beginning, middle, or end of the slice.
With that, my number of test cases becomes:
Hereās a simplified version of the test:
func TestArchetype_MoveExhaustive(t *testing.T) {
prng := testutils.NewRand(t)
// Generate all possible subsets (archetype) with N component types.
const maxComponents = 6
pool := newArchetypePool(maxComponents)
// Exhaustively test combinations of source * destination * entity count * moved entity row.
gen := testutils.NewGen()
for !gen.Done() {
src := pool[gen.Index(len(pool))] // Randomize source
dst := pool[gen.Index(len(pool))] // Randomize destination
// Populate source with 1-3 entities, destination with 0-2 entities.
countSrc := gen.Range(1, 3) // Randomize source entity count
for eid := range countSrc {
src.newEntity(EntityID(eid))
}
countDst := gen.Range(0, 2) // Randomize destination entity count
for eid := range countDst {
dst.newEntity(EntityID(countSrc + eid)) // Entity IDs are globally unique
}
// Pick which entity to move.
row := gen.Index(len(src.entities)) // Randomize moved entity row
eid := src.entities[row]
if src == dst {
assert.NotPanics(t, func() {
src.moveEntity(dst, eid)
}, "self move should be a no-op and must not panic")
continue
}
src.moveEntity(dst, eid) // Do the actual move.
// Assert properties here...
}
}
Exhaustigenās API is so nicely designed. It still amazes me that this small amount of code is enough to cover all variable combinations. It does some intricate machinery under the hood, so if youāre curious about how it works, check out the original blog post!
On Tooling
I didnāt use any third-party PBT libraries, e.g. Rapid, or Goās testing/quick and testing.F. For my use cases, Goās rand.Rand PRNG already provided most of the functionality I needed.
As I wrote the tests, I gradually extracted utilities once the patterns became clear. These helpers live in a small testutils package. The first thing I added was support for a global seed for the test PRNGs.
Since PBT relies on randomness, rerunning the same test can produce different results. To make failures reproducible, I seed all test PRNGs with the same value. If a test fails, I can reproduce it by rerunning it using the same seed:
TEST_SEED=0x0123456789abcdef go test ./...
The seed is printed if you run the tests with the -v (verbose) flag. Hereās the code for it:
var Seed uint64 // Global test seed
func init() {
Seed = uint64(time.Now().UnixNano())
if envSeed := os.Getenv("TEST_SEED"); envSeed != "" {
parsed, err := strconv.ParseUint(envSeed, 0, 64)
if err == nil { // Only set using the env if it's valid
Seed = parsed
}
}
fmt.Printf("to reproduce: TEST_SEED=0x%x\n", Seed)
}
func NewRand(t *testing.T) *rand.Rand {
t.Helper()
// We derive a unique seed for each test by XOR-ing the global seed with a hash of the test name.
h := fnv.New64a()
_, err := h.Write([]byte(t.Name()))
require.NoError(t, err)
perTest := h.Sum64()
return rand.New(rand.NewPCG(Seed^perTest, Seed^(perTest<<1)))
}
Tests use NewRand as their source of randomness. Each test also uses a different seed, which is derived from the same original seed. This prevents parallel tests from generating the same random values. While this isnāt a problem for isolated tests, some tests share a common resource, e.g. a server or database. Using the same seed could cause them to pick the same resources, leading to collisions and tests corrupting other tests.
[7]
[7]
(Almost) Deterministic Simulation Testing
After I finished with the unit tests, I moved on to implement DST. Cardinal wasnāt initially designed with DST in mind, so it took quite a few refactors to get it mostly there. Almost all of the non-deterministic parts are stubbed out, except for the concurrent systems scheduler.
Itās a bit difficult to replace, so I just left it there. [8] [8] To guard against scheduler bugs that may not be 100% reproducible, I wrote extra tests to verify the schedulerās internal concurrency works correctly. This way, I can still have confidence in its behaviour without needing to fully eliminate its non-determinism.
Iāve always thought it would be hard to implement DST, considering my first reference is TigerBeetleās VOPR, which has a huge number of components and configurations. Then, I stumbled upon this blog post by Oskar Wickstrom about implementing a lightweight DST for Amp. [9] [9] It showed me that at its core, DST isnāt really a complicated concept, which made DST more attemptable.
I had the same realisation while writing the model-based tests. They looked a whole lot like the lightweight DST from the blog post. And they are! DST is basically just a bigger model-based test, with non-deterministic components stubbed out. Since I already wrote a lot of them, implementing the DST didnāt turn out to be too hard. [10] [10]
How it Works
I donāt remember where I first learned this, but TigerBeetle is structured so that it moves in fixed ticks, where each tick time is injected. In production use, this uses a regular wall clock, but in simulation, it uses a simulated clock. Well, Cardinal is a tick-based game server! It was already well-prepared for DST in this regard.
Cardinal only updates its state on every call to Tick:
// NOTE: The World type is Cardinal.
func (w *World) Tick(ctx context.Context, timestamp time.Time) {
w.currentTick.timestamp = timestamp
// Collect the commands for this tick.
_ = w.commands.Drain()
// Tick ECS world.
w.world.Tick()
// Emit events.
if err := w.events.Dispatch(); err != nil {
w.tel.Logger.Warn().Err(err).Msg("errors encountered dispatching events")
}
// Publish snapshot every N ticks.
if w.currentTick.height%uint64(w.options.SnapshotRate) == 0 {
snapshotCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
w.snapshot(snapshotCtx, timestamp)
cancel()
}
// Increment tick height.
w.currentTick.height++
}
In production use, Cardinalās StartGame method handles its lifecycle. This basically runs an infinite loop, calling Tick every N seconds. For the simulation test, all we have to do is just replace StartGame and have the test harness control Cardinal using Tick.
Hereās a simplified version of the DST harness:
func RunDST(t *testing.T, setup func(*World)) {
prng := testutils.NewRand(t)
cfg := newDSTConfig(prng)
fix := newDSTFixture(t, cfg, setup)
// Add the world's commmands as operations in the dst config.
cfg.addCommandsAsOperations(fix.world)
cfg.log(t) // Log the test config
// The main test loop:
tick := 0
for tick < cfg.Ticks {
op := testutils.RandWeightedOp(prng, cfg.OpWeights)
switch {
case op == opTick:
timestamp := time.Unix(int64(tick), 0)
fix.world.Tick(context.Background(), timestamp)
// Assert structural ECS invariants after every tick.
ecs.CheckWorld(t, fix.world.world)
tick++
case strings.HasPrefix(op, opCommandPrefix):
cmdName := strings.TrimPrefix(op, opCommandPrefix)
cmd := fix.randCommand(t, prng, cmdName)
require.NoError(t, fix.world.commands.Enqueue(cmd))
case op == opRestart:
fix.world.reset()
case op == opSnapshotRestore:
fix.world.reset()
require.NoError(t, fix.world.restore(context.Background()))
}
}
// Final validation after all randomized operations complete.
ecs.CheckWorld(t, fix.world.world)
}
This test does a few things:
- Create a test configuration with random values. There are only a few ATM, and no fancy fault injections, but itās a start.
- Initialises Cardinal with that configuration and replaces non-deterministic components (e.g. NATS handlers, snapshot storage) with deterministic versions that contains assertions. The non-deterministic parts are tested separately in integration tests.
- Register systems using the user-provided
setupfunction. Iāll talk more about this soon. - Extract type information from the command types used in the systems using reflection. Weāll use this to generate random commands to send to the Cardinal.
- Run for N ticks (configurable) and perform a random operation each time.
- We switch between several operations: tick the world, send a random command, reset the world, and restore from a snapshot. The tick operation is always enabled, but the rest are random (as in swarm testing).
- We validate the internal state every tick and at the end of the test.
While implementing this, there was a major design choice I had to make. Cardinal is a framework, and its input is user-provided game logic in the form of systems. Iāll refer to this set of systems as just āgameā. The test harness above can only verify Cardinalās internal invariants. It cannot verify whether the game is logically correct. I had to make a choice between:
- Keeping the harness specific to a single, hand-coded game designed to test Cardinal. This allows us to add more instrumentation to verify some extra invariants and also to validate the game logic, but also means it must be a private type to access the required internal fields.
- Make the harness game-agnostic and public. This gives a few benefits, namely, users can use it with their games, and we also get free inputs to test Cardinal on. We can only verify a subset of all our properties, but weāre more likely to get something we didnāt expect.
In the end, I went with the second design, mainly because we can explore more of the state space with it. If we had stuck with a single input, the harness might be biased.
So now, we have an importable DST harness that users can use, but not to test their game logic, rather to test that Cardinal works correctly with it. Hereās what its usage looks like:
import "github.com/argus-labs/world-engine/pkg/cardinal"
func YourTestDST(t *testing.T) {
cardinal.RunDST(t, func(w *cardinal.World) {
// Register your systems here...
cardinal.RegisterSystem(w, system.PlayerSpawnerSystem, cardinal.WithHook(cardinal.Init))
cardinal.RegisterSystem(w, system.CreatePlayerSystem)
cardinal.RegisterSystem(w, system.RegenSystem)
cardinal.RegisterSystem(w, system.AttackPlayerSystem)
})
}
Expectations VS Reality
So initially, I tested with a simple game that I asked AI to write. I was really excited because of all the hype around DST, like how it seemingly found all the bugs in FoundationDB. The first time I ran it, the test froze. When I reran it, it froze again! My DST has found my first bug!
After some investigation, I found out that it was a bug in Cardinalās scheduler. Itās a deadlock that happens only when the world is reset after an odd number of ticks. A ridiculous bug! I probably wouldāve caught this myself in local development, but the fact that the simulator caught it this fast was amazing.
Then I ran the test again, a lot of times. Each iteration has a different seed and configuration. Nothing. Sometimes I get a bug, but it turned out to be a bug in the harness rather than Cardinal. I tried tuning the test configuration a little bit, but still nothing. At that point, I wasnāt sure what was wrong. Was my harness too simple to find more bugs? Or was my code just bug-free? Was it finding bugs but I didnāt have the correct assertions to catch them?
I donāt know the answer. And it seems I wasnāt the only one with this issue. I was watching the BugBash Podcast episode with the co-authors of DDIA (2nd edition), and Chris Riccomini talked about his experience adding DST to SlateDB. He mentioned things like DST being high-investment but low-return, didnāt explore that much state, and that it only found three known bugs. This matched perfectly with my (short) experience!
Chris also mentioned that a better approach might be to apply DST on subcomponents much earlier in the process instead of going all in with full-system DST, which wouldāve given more value much earlier. Coincidentally, this is what I did (if you count the model-based tests as DST), and it also aligns with my experience. I had caught several bugs from those tests. Itās probably why my DST didnāt catch many bugs, but thatās just my wishful thinking.
Iāll also add this quote from Will Wilson, the CEO of Antithesis:
The biggest challenge of DST in my experience is that tuning all the random distributions, the parameters of your system, the workload, the fault injection, etc. so that it produces interesting behaviour is very challenging and very labor intensive. As with fuzzing or PBT, itās terrifyingly easy to build a DST system that appears to be doing a ton of testing, but actually never explores very much of the state space of your system. At FoundationDB, the vast majority of the work we put into the simulator was an iterative process of hunting for what wasnāt being covered by our tests and then figuring out how to make the tests better. This process often resembles science more than it does engineering.
Iām still not giving up, though. My DST harness is still small, which means it can only grow from here. I already have several improvements in mind, so Iāll stick with it for a bit longer.
End-to-End Fuzzing
The final test I added was the E2E tests. This actually followed a similar structure to the DST, where we set up Cardinal with user-provided game logic and bombard it with randomised commands. The biggest difference is that it runs with the same, non-deterministic configuration as it would in production.
The actual harness code is pretty gnarly, so hereās an extremely simplified version of it:
func RunE2E(t *testing.T, setup E2ESetupFunc) {
prng := testutils.NewRand(t)
cfg := newE2EConfig()
fix := newE2EFixture(t, setup)
// Add the world's commmands as operations in the dst config.
cfg.addCommandsAsOperations(fix.world)
cfg.log(t) // Log the test config
// Start the world's run loop in a background goroutine.
ctx, cancel := context.WithCancel(context.Background())
runErr := make(chan error, 1)
go func() {
runErr <- fix.world.run(ctx)
}()
deadline := time.After(cfg.Duration)
sendLoop:
for {
select {
case <-deadline:
err := stopWorld()
if err != nil && !errors.Is(err, context.Canceled) {
t.Fatalf("world.run returned unexpected error: %v", err)
}
break sendLoop
default:
name := testutils.RandWeightedOp(prng, cfg.CommandWeights)
cmd := fix.randCommand(t, prng, name)
fix.sendCommand(t, cmd)
}
}
// Final validation after the world has fully stopped.
ecs.CheckWorld(t, fix.world.world)
}
All this does is run Cardinal in a separate goroutine and send it randomised requests. Unlike in DST, where you control the number of ticks to run, here you control the test duration. Since it uses wall clock time, itāll run much slower than other tests. This is why I made the E2E tests disabled by default. To run them, pass the -e2e.run flag to go test.
The usage API is also similar to DST, but instead of passing a function to register systems, youād return the actual cardinal.World instead. In DST, we randomise Cardinalās config, but in the E2E test we want it to have the same config as production to test more realistically:
import "github.com/argus-labs/world-engine/pkg/cardinal"
func TestE2E(t *testing.T) {
cardinal.RunE2E(t, func() *cardinal.World {
world, err := cardinal.NewWorld(cardinal.WorldOptions{
// Your world configuration...
})
require.NoError(t, err)
// Register systems here...
return world
})
}
Whatās Next?
Phew⦠That was long. I had cut out a lot of things, including a section on testing concurrent code, but I still ended up with almost 7000 words. If you read the entire thing, big props to you.
Hereās the source code if you want to take a closer look. After all this, Iām even more excited about software correctness. There are so many more things I want to try, e.g. improve my DST harness, test using a deterministic hypervisor :), formal methods, etc.

Only the tests were slop, I actually put a lot of thought into the code itself! ā©ļø
I like comparing PBT and fuzzing to convergent evolution, which is when species of different lineages evolve similar features. PBT came from functional programming, fuzzing came much earlier from systems/security. The approach is similar, but their focus is different (program correctness vs security issues). Yes, I got this from Antithesis. ā©ļø
I recommend reading Antithesisās PBT primer first, and then F# for Fun and Profitās PBT series for some more practical tips. ā©ļø
My real reason for adopting DST is just because I think itās cool. Itās what got me interested in TigerBeetle in the first place. ā©ļø
There are some approaches that address this problem (somewhat), e.g. metamorphic testing. It solves this by using oracles derived from the tests themselves (through special properties called metamorphic relations), but itās not a universal solution. ā©ļø
this is probably why shrinking in pbt works. while the original intent is too create a simpler representation for debugging, the fact remains that the same bugs found in larger inputs can usually be reproduced in smaller ones. ā©ļø
I tried to keep the explanation short earlier, so hereās a concrete example. Some of my integration tests share a single, in-memory NATS server. When tests used the same seed, I noticed they would sometimes fail. The reason for this is that multiple tests were accessing the same NATS subjects, even though the tests were meant to be isolated. Using a different seed for each test solves this.
Also, before the rewrite, each test would create its own NATS server. This is simpler, but it fails when there are a lot of parallel tests. The tests would use up all memory in CI (and sometimes in my local machine) and OOM. ā©ļø
Itās technically easy to replace, and Iām also considering replacing it with a simpler, sequential version, but first, I need to do some benchmarks with a lot of different system schedules to understand the implications. Itāll take a while, so I decided to just leave it there so I could get the DST set up first. ā©ļø
Which is also, my coding agent of choice :) ā©ļø
Thereās another way to do DST in Go. This blog post by Polar Signals explains how they implemented DST in Go by compiling their program to WASM and (slightly) modifying the Go runtime. ā©ļø