Three Tests I Write Before I Let an Agent Touch a Function

I wrote a post on Monday about the habits I stopped after a year of writing code with an agent in the loop. The line that drew the most replies was the one where I admitted I no longer read every line of agent-written code at the level a clean-room reviewer would. The follow-up question was the obvious one: what does "audit the boundaries" actually look like in practice.

The honest answer is that the audit step moved out of the diff review and into the test file. Before I let the agent touch a function with non-trivial behavior, I write three tests. They are not exhaustive. They catch the three specific kinds of bug agents have made me ship.

1. The shape test, with one impossible input

The first test pins down the input and output schema. Not the happy path. The impossible path.

The function in OpenClaw that ranks options spread candidates takes a list of contracts and returns a sorted subset. The shape test calls it with an empty list, with a single contract, and with one contract that has a None field where the schema expects a float. The first two should return an empty list and a single-element list. The third should raise a clear, named exception.

The reason I write this first is that agents are very good at writing code that handles the happy path and unsurprisingly bad at writing code that fails loudly on bad inputs. The default behavior is to silently coerce a None to zero, or to skip the offending row, or to return an empty result without any indication of why. Each one is a future bug that will show up at 09:32 Pacific on a Monday when the upstream feed includes a contract with a missing greek and the spread engine quietly returns "no candidates" for two hours.

The shape test makes the contract explicit: bad input is loud, not absorbed. Once the test is in place, the agent writes the function with explicit guards. Without the test, the agent writes the function without them roughly four times out of five.

2. The boundary test

The second test exercises the boundaries of every range the function reasons about. Empty, exactly one, exactly the threshold, exactly threshold-minus-one. If the function has a numeric cap at ten positions, I test 9, 10, and 11. If it has a date filter for DTE between 7 and 40, I test 6, 7, 40, and 41. If it sorts, I test the case where two elements tie on the sort key.

Boundaries are where agents quietly use the wrong comparison operator. I have shipped >= where I meant >, and < where I meant <=, both written by an agent and both passed by my "this looks right" review. The boundary test is the one piece of test code I have written where I can see, in numbers, that it pays for itself. Three of the last six bugs I caught before shipping were boundary errors in agent-written code.

The cost of the test is small. A parametrized test with five values runs in single-digit milliseconds. The cost of the bug it catches, when the bug is in a position-sizing function in a trading system, is not bounded.

3. The invariant test, for things types do not protect

The third test is the one I add last and the one that has paid the most. It checks an invariant the type system does not express.

A simple example: the function that builds a bracket order in OpenClaw returns a tuple of entry, stop, take_profit. The types say "three Decimals." What the types do not say is that for a long order, stop must be strictly less than entry must be strictly less than take_profit. The invariant test asserts that ordering, on a few representative inputs.

The first time I wrote this test, the agent had already been writing bracket orders for me, and the function had been in production for three weeks. The test failed on the third input I tried, on a low-volatility day where the stop and the entry rounded to the same Decimal under the rounding rule the agent had chosen. Stop equal to entry is not a bracket. It is a market order with extra steps, and the broker would happily accept it.

I now write an invariant test for any function whose output is structured. It does not have to be elaborate. One assertion per invariant. The point is to write down, in executable form, the things a reader could not infer from the type signature.

The pattern

These three tests are not a test plan. They are a starting kit. Once they are in place, I let the agent write or rewrite the function freely. The agent is good. The tests are the part I trust myself to write and not the part I want to delegate, because the tests are the contract, and the contract is the thing I am actually paid to design.

The other thing the three tests do is force me to know what I want before I ask. The test for the impossible input requires me to know which inputs are impossible. The boundary test requires me to know where the boundaries are. The invariant test requires me to write down the relationship between fields. By the time the tests pass, the function is almost an afterthought. Most of the work is in the contract.

A year ago I would have called this overhead. Now I call it the work. The agent writes the rest.

Three Tests I Write Before I Let an Agent Touch a Function

1. The shape test, with one impossible input

2. The boundary test

3. The invariant test, for things types do not protect

The pattern

You Might Also Like

The Number My Model Is Not Allowed to Know

The Agent Edit I Almost Merged

The Ninety Minutes My Engine Sits Out

The Numbers I Used to Ask You to Trust

Latest Blog Posts

The Ninety Minutes My Engine Sits Out

The Numbers I Used to Ask You to Trust

Five Up, Three Down, Even Money

Related Tools & Demos

Multi-Model LLM Harness

Automated Trading System

Personal Health Analytics

Stay in the Loop