Three Tests I Write Before I Let an Agent Touch a Function
Suneet Malhotra
Apr 29, 2026
I wrote a post on Monday about the habits I stopped after a year of writing code with an agent in the loop. The line that drew the most replies was the one where I admitted I no longer read every line of agent-written code at the level a clean-room reviewer would. The follow-up question was the obvious one: what does "audit the boundaries" actually look like in practice.
The honest answer is that the audit step moved out of the diff review and into the test file. Before I let the agent touch a function with non-trivial behavior, I write three tests. They are not exhaustive. They catch the three specific kinds of bug agents have made me ship.
1. The shape test, with one impossible input
The first test pins down the input and output schema. Not the happy path. The impossible path.
The function in OpenClaw that ranks options spread candidates takes a list of contracts and returns a sorted subset. The shape test calls it with an empty list, with a single contract, and with one contract that has a None field where the schema expects a float. The first two should return an empty list and a single-element list. The third should raise a clear, named exception.
The reason I write this first is that agents are very good at writing code that handles the happy path and unsurprisingly bad at writing code that fails loudly on bad inputs. The default behavior is to silently coerce a None to zero, or to skip the offending row, or to return an empty result without any indication of why. Each one is a future bug that will show up at 09:32 Pacific on a Monday when the upstream feed includes a contract with a missing greek and the spread engine quietly returns "no candidates" for two hours.
The shape test makes the contract explicit: bad input is loud, not absorbed. Once the test is in place, the agent writes the function with explicit guards. Without the test, the agent writes the function without them roughly four times out of five.
2. The boundary test
The second test exercises the boundaries of every range the function reasons about. Empty, exactly one, exactly the threshold, exactly threshold-minus-one. If the function has a numeric cap at ten positions, I test 9, 10, and 11. If it has a date filter for DTE between 7 and 40, I test 6, 7, 40, and 41. If it sorts, I test the case where two elements tie on the sort key.
Boundaries are where agents quietly use the wrong comparison operator. I have shipped >= where I meant >, and < where I meant <=, both written by an agent and both passed by my "this looks right" review. The boundary test is the one piece of test code I have written where I can see, in numbers, that it pays for itself. Three of the last six bugs I caught before shipping were boundary errors in agent-written code.
The cost of the test is small. A parametrized test with five values runs in single-digit milliseconds. The cost of the bug it catches, when the bug is in a position-sizing function in a trading system, is not bounded.
3. The invariant test, for things types do not protect
The third test is the one I add last and the one that has paid the most. It checks an invariant the type system does not express.
A simple example: the function that builds a bracket order in OpenClaw returns a tuple of entry, stop, take_profit. The types say "three Decimals." What the types do not say is that for a long order, stop must be strictly less than entry must be strictly less than take_profit. The invariant test asserts that ordering, on a few representative inputs.
The first time I wrote this test, the agent had already been writing bracket orders for me, and the function had been in production for three weeks. The test failed on the third input I tried, on a low-volatility day where the stop and the entry rounded to the same Decimal under the rounding rule the agent had chosen. Stop equal to entry is not a bracket. It is a market order with extra steps, and the broker would happily accept it.
I now write an invariant test for any function whose output is structured. It does not have to be elaborate. One assertion per invariant. The point is to write down, in executable form, the things a reader could not infer from the type signature.
The pattern
These three tests are not a test plan. They are a starting kit. Once they are in place, I let the agent write or rewrite the function freely. The agent is good. The tests are the part I trust myself to write and not the part I want to delegate, because the tests are the contract, and the contract is the thing I am actually paid to design.
The other thing the three tests do is force me to know what I want before I ask. The test for the impossible input requires me to know which inputs are impossible. The boundary test requires me to know where the boundaries are. The invariant test requires me to write down the relationship between fields. By the time the tests pass, the function is almost an afterthought. Most of the work is in the contract.
A year ago I would have called this overhead. Now I call it the work. The agent writes the rest.
Share this post
You Might Also Like
The Number My Model Is Not Allowed to Know
There is a rule I enforce across every agent I run, and it has nothing to do with how good the model is. The model writes the words. It never computes the numbers.
AI & AutomationThe Agent Edit I Almost Merged
An agent rewrote a forty-line function in my risk module. The diff was clean. The tests passed. The reason one of the tests passed is what I almost missed.
Quantitative TradingThe Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
Career & Best PracticesThe Numbers I Used to Ask You to Trust
My April posts reported measured numbers you had to take on faith. My recent ones derive every figure from public config. The change was not discipline. It was topology.
Latest Blog Posts
The Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
The Numbers I Used to Ask You to Trust
My April posts reported measured numbers you had to take on faith. My recent ones derive every figure from public config. The change was not discipline. It was topology.
Five Up, Three Down, Even Money
My bracket risks 3% to make 5%, which reads like a favorable bet. On a price with no drift it is exactly break-even, and the reason is a theorem, not a coincidence.
Related Tools & Demos
Multi-Model LLM Harness
One interface to call any AI model — capability routing, fallback chains, budgets, circuit breakers, and a quality feedback loop. A practical architecture pattern write-up.
Automated Trading System
Multi-engine trading platform with real-time risk management, regime-based strategy selection, and automated order execution.
View Source Code →Personal Health Analytics
Multi-modal health data platform integrating wearables, lab results, and lifestyle tracking with predictive habit modeling.
View Source Code →
Stay in the Loop
Get weekly insights on AI-driven QA, engineering leadership, and automation strategies.
No spam, ever. Unsubscribe anytime.