The Testing Techniques We Couldn't Afford (Until Now)
Sorry, But There's Gonna Be Some Math On This Quiz
Twenty years ago, I was an early adopter on Extreme Programming. Not all of it worked for me—pair programming requires inflicting yourself on a colleague in real-time, and no one deserves to be subjected to me in work mode. Even my wife flees the room when I’m in deep codespace. But the testing philosophy resonated deeply.
The idea that you could write BETTER code by writing MORE code seemed almost magical. Test-first development, comprehensive coverage, constant refactoring enabled by a safety net of tests. Kent Beck and the XP crowd were preaching a gospel that felt true even when it seemed impractical.
I was a very fast and not very meticulous coder back then. XP’s testing discipline forced me to slow down and think. It made me better. Tests as documentation. Tests as design tools. Tests as courage—the confidence to change anything because you’d know immediately if you broke something.
But it didn’t scale. The economics killed it.
With developer costs at $150/hour, doubling or tripling your code volume to include comprehensive tests was a luxury most teams couldn’t sustain. We’d start projects with good intentions—100% coverage, test-first discipline, all the toys. Then deadlines hit. Features needed shipping. The tests that were “nice to have” got skipped. The refactoring that tests were supposed to enable got deferred. Technical debt accumulated.
The dirty secret of XP was that it worked beautifully for small, disciplined teams on greenfield projects with enlightened management. For everyone else—which was most of us—it was an aspirational practice that slowly degraded into “write some tests when you remember to.”
Now, two decades later, the economics have inverted. Product code is nearly free. Test code is even cheaper than that. And suddenly all those XP ideas about testing discipline don’t just make sense—they’re becoming the baseline.
But we can go much further than XP ever imagined.
The Testing We Couldn’t Afford
Software developers have long known how to test code better. We’ve had the theory, we’ve had the tools, we’ve had the methodologies. What we didn’t have was the time or the budget to actually do it properly.
At $150/hour, every testing decision was a brutal economic calculus. Write unit tests? Sure, for the critical paths. Aim for 100% coverage? That’s a nice idea, but we’ve got features to ship. Property-based testing? Mutation testing? Those sound great in conference talks, but let’s be realistic about our velocity targets.
The result was that we settled for “good enough” testing, where “good enough” often ended up meaning “doesn’t write crash reports to the logs very often”. We wrote the tests we could afford, caught the bugs we thought of, and shipped with our fingers crossed. When production broke at 3am, we’d write a regression test for that specific bug and move on. Technical debt accumulated not just in our code, but in our test coverage.
Now let me show you what becomes possible when testing is nearly free.
Unit Test Coverage: Obvious Improvement Is Obvious
Let’s start with the most basic: of course you should aim for 100% coverage on your unit test suites. What possible excuse do you have not to do so?
I can already hear the objections. “100% coverage doesn’t mean the tests are good.” “You can game the metrics.” “We need to balance testing with feature velocity.” These are all true statements, and they were all reasonable arguments when writing tests consumed significant developer time.
But here’s the thing: at near-zero cost, all those tradeoffs disappear. Yes, coverage metrics can be gamed. But with AI writing your tests, gaming the metrics is actually harder than just testing the code properly. Yes, 100% coverage doesn’t guarantee correctness. But it’s a hell of a lot better than 60% coverage, and it costs you nothing.
One amusing problem with aiming for 100% code coverage is that, at the time of this writing, agentic coding tools will often push back on the need for 100% coverage, because they were trained on a corpus full of software developers arguing against the need for full code coverage. These sorts of “I learned it from you!” problems are sometimes annoying and sometimes amusing, but usually you can push through them with a bit of prompting.
Future developers—the ones working in five years—are going to hear about us testing while only occasionally checking coverage and think of us as hopelessly backward. We’ll be seen like developers who didn’t use version control, or who deployed by manually copying files to servers. “You mean to say you shipped code to production that you hadn’t even executed? How did that possibly work.”
Property-Based Testing: Finding Bugs You Didn’t Think Of
Property-based testing has been around since QuickCheck was introduced in Haskell in 1999. The idea is elegant: instead of writing tests with specific inputs and expected outputs, you write properties that should always hold true, and the testing framework generates hundreds or thousands of random inputs to try to violate those properties.
For a Java parser I had written for a side project, instead of writing:
assert parse(”public class Foo {}”) == expected_astI specified properties like:
for any valid Java class:
- parsing should succeed
- the resulting AST should be valid
- prettyprinting and re-parsing should yield the same ASTClaude and I then wrote a framework then generated thousands of random Java classes and checked these properties. It found edge cases I would never have thought to test: classes with deeply nested generics, unusual combinations of annotations, tricky interactions between inner classes and inheritance.
Property-based testing is a way of converting electricity into humility. Every time that generating a few hundred thousand test cases warms up your laptop and scorches your thighs, you absolutely will find a bug.
We’ve known this technique was better for 25 years. Why didn’t we use it? Because writing a good property-based test suite required deep understanding of the problem domain, careful thought about invariants, and significant time investment. Most teams couldn’t justify the cost.
Now? “Hey Claude, add property-based tests for this module using the invariants that seem appropriate.” Fifteen minutes later, you’ve got comprehensive coverage that would have taken a human developer days or weeks to write, covering more invariants than you would have thought of yourself.
Mutation Testing: Checking That Your Tests Actually Work
Here’s a dirty secret about test suites: sometimes they pass even when they shouldn’t. You write a test, it goes green, you move on. But does that test actually validate anything meaningful, or is it just checking that the code runs without crashing?
Mutation testing solves this by introducing small bugs into your code—”mutations”—and checking whether your tests catch them. If you change > to >= and all your tests still pass, or if you add .tail() in the middle of a stream of list operations and everything’s still green, you’ve found a gap in your coverage. Your tests aren’t actually verifying the boundary conditions they should be.
Mutation testing has been around since the 1970s. It’s one of the most effective ways to measure test suite quality. The problem? It’s computationally expensive and the results require careful analysis to be useful. A typical mutation testing run might take hours and generate hundreds of potential issues that need human review.
This made mutation testing a luxury that only the most critical codebases could afford. Until now.
With AI assistance, you can not only run mutation testing regularly, but have an agent analyze the results (or, better, build specialized scripts to repeatably analyze your results), prioritize the actual gaps in coverage, and even propose new tests to catch the missed mutations. What used to be a quarterly exercise for high-value code becomes a routine part of your development workflow.
Parametric End-to-End Testing: The Quadratically Awesome Test Suite
Here’s a testing pattern I’ve come to love: parametric end-to-end tests for transformation pipelines. A large amount of software consists of taking some input from the user, applying a bunch of transformations to it, and returning an output. These can be large operations like compilers or query translation pipelines, or small stuff like serialization. Anything that can be structured as a series of mostly-standalone computational steps.
The idea of parametric end-to-end is straightforward. At it’s simplest, you have a directory full of input files and a parallel directory of expected output files. Your test runner processes each input, compares the output to the expected result, and reports any mismatches. When you get a bug report, you add a new test case: drop the input file in the inputs directory, run your code (once it’s fixed), and save the output as the new expected result. For extra bonus nachos, every new buggy input gives you a test case which you can spin up a debugger on, so the end-to-end test has value in issue analysis, not just in preventing regression.
Over time, you accumulate hundreds or thousands of test cases. And here’s where it gets interesting: as your test runner gets smarter—better error messages, better diff visualization, better reporting, more output conditions checked—every test case gets better. And as you add more test cases, the value of each improvement to the test runner multiplies across all of them.
The test suite becomes quadratically awesome: N test cases × M features in your test runner = N×M value at N+M cost.
With AI assistance, you can:
Generate comprehensive test cases covering edge cases you’d never think of
Build a sophisticated test runner with excellent diagnostics
Automatically expand your test suite whenever you fix a bug
Continuously improve the runner without guilt about the time investment
The parametric test suite goes from “nice idea we never quite implemented” to “powerful regression prevention that gets better every week.”
Metamorphic Testing: The Technique Nobody’s Heard Of
Metamorphic testing is perhaps the most underutilized testing technique in software engineering. This is in spite of the fact that probably hundreds of developers (myself included) have reinvented it over the years. The core insight: even when you don’t know the correct output for a given input, you often know relationships between different inputs and outputs.
For example, with a sorting function:
Sorting an already-sorted list should return the same list
Sorting a list should not change it’s length, nor a histogram of it’s contents
Sorting a reversed list should give the same result as sorting the original
You can verify these properties without ever checking that the sort is actually correct. If any of these metamorphic relations are violated, you’ve found a bug.
Metamorphic testing is incredibly powerful for complex systems where correctness is hard to verify directly. Machine learning models, numerical simulations, compilers, graphics renderers—these all have metamorphic properties you can test even when you can’t easily verify the output directly.
But writing metamorphic tests requires creative thinking about invariants and relationships. It’s been relegated to academic papers and a few specialized domains. With AI assistance, you can ask “what metamorphic properties should this system satisfy?” and get a thoughtful analysis that identifies non-obvious relationships worth testing.
Architectural Testing: Making Your Intentions Executable
Here’s a testing technique that’s emerged more recently but addresses a problem as old as software itself: how do you ensure your architecture doesn’t rot over time?
You start a project with clean layering. Controllers shouldn’t call the database directly. Domain logic shouldn’t depend on infrastructure. Utilities shouldn’t depend on business logic. All web endpoints should have authentication and rate-limiting annotations. You document these rules. You discuss them in code reviews. You nag people about violations.
And slowly, inevitably, the architecture degrades. Someone needs a quick fix and adds a direct database call from a controller. Another developer imports a business domain class into a utility. Each violation is small, each has a justification, and collectively they turn your architecture into spaghetti.
In a perfect world, all of this would be specifiable in your type system and enforced by your compiler, with in-IDE error highlighting for architectural violations and quick-fixes to invert or abstract dependencies as necessary. The manifest lack of Scarlett Johansson in my lap at this time proves that this is not a perfect world.
Tools like ArchUnit (for Java) and similar frameworks in other languages let you write tests for architectural rules:
classes().that().resideInAPackage(”..controller..”)
.should().onlyAccessClassesThat().resideInAnyPackage(”..service..”, “..dto..”)If a controller tries to access the database layer directly, the test fails. Your architecture rules become executable. Violations get caught in CI before they reach production.
With AI assistance, you can describe your architecture in plain language and have it generate comprehensive architectural tests. “Our architecture has three layers: controllers, services, and repositories. Controllers can only call services. Services can call repositories and other services. Nothing should depend on implementation details in the ‘internal’ packages.”
Fifteen minutes later, you have a complete architectural test suite that enforces these rules and catches violations automatically. This is tests-as-specifications at the highest level of abstraction. You’re not just testing that functions work correctly—you’re testing that your entire system structure maintains the properties you intended. And you’re doing it with the same ease that you’d write a unit test.
The Economic Shift
At $150/hour, the testing calculus was brutal. Every hour spent writing tests was an expensive hour not spent building features. Test coverage was a luxury good—you bought as much as you could afford, which was never quite enough.
The unspoken rule was: test the happy path, test the edge cases you can think of, write regression tests when bugs happen in production, and ship it.
This wasn’t because developers were lazy or didn’t care about quality. It was basic economics. Given finite time and finite budget, you optimized for shipping features that created business value. Comprehensive testing was the thing you’d do in your copious free time, which was scheduled for “never”.
At near-zero cost, this entire calculus inverts.
Writing tests is no longer trading feature velocity for quality. Tests become essentially free, which means the question shifts from “can we afford to write these tests?” to “why wouldn’t we write these tests?”
Property-based testing finds bugs you didn’t think of. Mutation testing verifies your tests actually work. Parametric E2E testing creates a regression suite that grows smarter over time. Metamorphic testing validates complex systems you couldn’t easily test otherwise.
None of this is new theory. We’ve known about these techniques for decades. What’s new is that we can finally afford to use them.
Tests as Documentation (And the Positive Feedback Loop)
Here’s a bonus insight that makes all of this even more powerful: comprehensive tests are specifications, and agentic coding tools thrive on specifications.
When you have property-based tests that enumerate the invariants your system should maintain, you’re not just catching bugs—you’re documenting what the code is supposed to do in a machine-readable, always-up-to-date format.
This creates a positive feedback loop with AI-assisted development:
You write comprehensive tests using AI assistance
Those tests serve as specifications for what the code should do
When AI generates new code, it can read those tests to understand the requirements
The AI-generated code is better because it has better specs
You can confidently write even more sophisticated tests because the code is better
Loop back to step 2
Compare this to the traditional development cycle:
Write minimal tests because time is limited
Specifications exist mostly in people’s heads or outdated docs
New code is written based on incomplete understanding
Tests catch some issues but miss others
Documentation gets more out of date
Loop back to step 2 (but worse)
The AI-enabled version is a virtuous cycle. The traditional version is a vicious one.
But Can AI-Written Tests Catch AI-Written Bugs?
The skeptical question I always get: “If AI is writing both the code and the tests, won’t it make the same mistakes in both places?”
With simple unit tests, that is absolutely a risk, particularly since the AI can fall into the same trap as humans of writing the test based on the implementation. For these more advanced methodologies, the answer is no, because these testing techniques are orthogonal to implementation details.
Property-based testing doesn’t care how you implement sorting, it just checks that the result has the properties a sorted list should have. Mutation testing validates that your tests catch changes to the code, regardless of who wrote that code. Metamorphic testing verifies relationships between inputs and outputs without checking correctness directly.
These approaches catch bugs precisely because they don’t make assumptions about how the code works. They check properties, invariants, and relationships. An AI that writes a subtle off-by-one error in the code is unlikely to make the same error in a property-based test that generates thousands of random inputs, precisely because the test isn’t trying to implement the same logic.
Is it perfect? No. Can you still ship bugs? Yes. But you’re dramatically more likely to catch issues before they hit production.
How to Sleep at Night at AI Speeds
Here’s what worries me about AI-assisted development: the velocity is intoxicating. You can ship features faster than ever before. You can prototype in hours what used to take weeks. You can experiment with architectural changes that would have been too expensive to try.
But velocity without quality is just moving fast toward catastrophe. We have to figure out how to trade this incredible new coding velocity for quality, or the doomsayers will be right.
The testing techniques I’ve described here—techniques we’ve known about for decades but couldn’t afford to use—are how you maintain quality while moving at AI speeds. They’re how you catch the bugs that inevitably slip through when you’re generating thousands of lines of code per day. They’re how you sleep at night knowing that your codebase is growing faster than you can personally review every line.
Some of this is tooling I’ve built into Bad Dave’s Robot Army. The test-automator subagent and /test-review command analyzes your test suite and suggests improvements. The testing-focused agents can generate property-based tests, mutation tests, and parametric test suites. They can analyze test results and prioritize issues.
But the larger point stands independent of any particular tool: we’re entering an era where the testing techniques we always knew were better finally become practical. Where 100% coverage isn’t an aspirational goal but table stakes. Where property-based testing is standard practice, not an advanced technique mentioned in conference talks.
Future developers will look back at our era—the brief window when we could generate code at incredible speeds but were still testing like it was 2020—and wonder what we were thinking.
I’d rather not be on the wrong side of that judgment.
The /test-review command and testing-focused agents are part of Bad Dave’s Robot Army. If you’re building testing infrastructure for AI-assisted development—or just trying to sleep at night while shipping at AI speeds—I’d love to hear what’s working for you. This post as always was constructed with the able assistance of Claude Sonnet 4.5. “Write drunk, edit sober” works better than expected when the drunk one is made of math.


Incredible stuff. Getting huge value from this Substack. So glad I came across it.