Lighthouses for the Agentic Coding Era

The reliability toolkit you already have, and why it matters more now

Jan 26, 2026

A fascinating post came over the transom this week from Rafael at NPC Inc., titled “Lost in Between Speed and Scale.” The core argument: organizations are at risk of failing under AI acceleration because “irreversible commitments propagate faster than shared understanding can update and intervene”. In terms of software development, this means that we are now capable of shipping changes at tempos faster than we can understand and adapt to those changes. This isn’t a problem unique to AI, or even a failing of AI per se, but simply a problem of mismatch between the systems evolved for a slower pace and the blistering pace at which AI-enabled systems can produce. He calls this “disorientation risk”—the gap between how fast you’re moving and how well you know where you are.

It’s a good frame. It’s better than “technical debt” which he contrasts it with, as doesn’t require anything in the code to be substandard or error. You can ship perfect code, but if you’re shipping it faster than your organization can learn it you are still at risk of disorientation. This isn’t a problem of “moving too fast,” which locates the problem in speed itself. The problem isn’t speed. The problem is speed without understanding. Thanks to agentic coding tools, developers are now capable of shipping software at speeds far faster than the systems which propagated understanding of that software through their organizations.

Rafael draws on maritime history—steam propulsion created ships that could travel faster and farther between position fixes, accumulating risk invisibly until they hit something. The institutional response wasn’t to make captains more careful. If your response to a safety problem is “be more careful” or “apply more diligence” then you are admitting that you don’t have an answer before you even start. The response instead was to build navigation infrastructure: lighthouses, traffic separation schemes, standardized charts. Mechanisms that constrained movement under uncertainty and provided touchpoints for orientation.

He argues organizations need equivalent infrastructure for AI-speed knowledge production. I agree. I also think we already have much of it, hiding in plain sight in your site reliability engineering (SRE) playbooks. Boring stuff, but it’s about to become much more important.

The Toolkit You Already Have

If you’ve run production systems at scale, none of these site reliability patterns will surprise you. Netflix built them. Google wrote books about them. What’s changed is the urgency. These aren’t just reliability patterns anymore. They’re disorientation countermeasures.

Rate Limiters (Inbound and Outbound)

Rate limiters are the most direct implementation of “constrain how far commitment can propagate.” They’re a governor on the engine.

Inbound rate limiting is familiar: don’t let external traffic overwhelm your system. But outbound rate limiting is the interesting one for AI-era disorientation. When your system starts generating new sorts of effects —API calls, database writes, downstream requests—faster than you can reason about them, outbound rate limits keep you from crashing your dependencies.

An AI-built system that can make 1000 different API calls per minute can also make 1000 mistakes per minute. It probably won’t, but “probably” probably isn’t good enough. The rate limiter doesn’t make the agent smarter. It makes the agent’s mistakes survivable.

Circuit Breakers (The Hystrix Pattern)

The circuit breaker pattern—popularized by Netflix’s Hystrix library—automatically halts requests to a failing dependency when error rates cross a threshold. The circuit “opens,” requests fail fast, and the system gets time to recover or alert.

In disorientation terms: the circuit breaker prevents cascade. It stops commitment from propagating through a system that’s already showing signs of being lost. You don’t need to understand why the downstream service is failing. You just need to stop throwing traffic at it until someone figures it out.

The key insight is that circuit breakers are automatic. They don’t require human judgment in the moment. The judgment happened earlier, when someone identified the fallible dependency and set the thresholds. This is what Rafael means by “navigation machines”—orientation embedded in infrastructure rather than dependent on real-time human attention.

Exquisite Alerting

I use “exquisite” deliberately. Not just alerting. Exquisite alerting.

The difference: alert fatigue is itself a form of disorientation. When everything is alerting, nothing is alerting. You’re traveling at full speed with alarm bells ringing constantly, which means the alarm bells have stopped being useful navigation signals.

Exquisite alerting means the signal-to-noise ratio is high enough that an alert actually changes behavior. It means thresholds tuned tightly enough to catch real anomalies, with enough context to enable action. It means the on-call engineer can look at an alert and know what to do next, rather than beginning a diagnostic odyssey.

AI-speed production raises the stakes here. When code ships faster, when configurations change faster, when systems evolve faster, the time between “something is wrong” and “something is very wrong” can be very short indeed. Alerting that would have been adequate at human-speed deployment becomes inadequate at AI-speed deployment.

Near-Instantaneous Rollback

This is the “make commitment reversible” play. If you can undo a deployment in seconds, the cost of a bad deployment drops dramatically. You can move faster because mistakes are cheaper.

The deeper point: rollback is a form of navigation. It’s the ability to say “we’ve gone off course, return to last known good position.” The faster you can execute that return, the less distance you travel while lost.

Blue-green deployments, feature flags, canary releases—these are all variations on the same theme. They make commitment staged and reversible rather than atomic and permanent. Each one is a navigation machine.

The AI-era pressure: when AI is generating code, the rate of potential bad or conflicting releases increases. Your rollback capability needs to match your deployment velocity. If you can ship ten times faster but can only rollback at the old speed, you’ve created a disorientation trap.

Recursive Component Reboot

This one’s less commonly implemented, but you know the pattern. When a component enters a bad state, automatically restart it. Don’t try to diagnose, don’t try to repair—just reboot and see if the problem resolves. “Did you try turning it on and off?”, but codified as an bona fide engineering best practice.

It sounds crude because it is crude. It’s also remarkably effective. Many transient failures clear on restart. The component comes back up, resynchronizes with its dependencies, and proceeds normally. If there’s still a problem, try rebooting again, this time with a larger subassembly. It’s somewhat ridiculous, but that doesn’t prevent it from actually working. That said, if you need an academic paper to justify this, you can find a very good one here.

In navigation terms: recursive reboot is a way of saying “I don’t know where I am or how I got here, but I know how to get back to a known starting point.” It’s not orientation. It’s reorientation. The system doesn’t need to understand its state; it just needs to reset to a state it does understand.

The pattern extends beyond individual components. Chaos engineering—deliberately injecting failures to verify recovery—is recursive reboot as design philosophy. You assume disorientation will happen and engineer the system to recover without human intervention.

Why These Matter More Now

Here’s the thing: these patterns are not new. They predate the current AI moment by a couple of decades. So why revisit them?

Because the failure mode they address is going to become much more common.

When humans write code, the rate of commitment is naturally throttled. Humans are slow. Code review adds delay. Deployment pipelines add delay. Testing adds delay. Documentation adds delay. Each delay is an opportunity for someone to notice “wait, this doesn’t seem right.”

AI removes many of those delays. Code appears faster. Deployment becomes more continuous. The natural pauses that once allowed orientation to catch up with commitment are being optimized away. Your release cycle risks becoming a denial-of-service attack on your infrastructure.

These SRE patterns are what remains after you’ve optimized away the human-speed delays. They’re the infrastructure-level backstops. Rate limiters, circuit breakers, alerting, rollback, recursive recovery—each one is a way of constraining commitment propagation without requiring real-time human judgment.

The organizations that already have this infrastructure, battle-tested and tuned, are better positioned for AI-speed development. They’ve already built the navigation machines. They just didn’t know that’s what they were called.

The organizations that are still relying on “careful developers” or “thorough code review” or “extensive manual testing” as their primary disorientation countermeasures are about to discover that those mechanisms don’t even remotely scale. When the rate of commitment exceeds the rate of human attention, you need infrastructure that doesn’t require human attention to function.

The Catch

Here’s what the SRE playbook doesn’t solve: these tools assume someone understood the system well enough to instrument it correctly.

Rate limiters require someone to know what rates are safe. Circuit breakers require someone to set appropriate thresholds. Alerting requires someone to know what conditions matter. Rollback requires someone to have built the deployment pipeline correctly in the first place.

In other words, these are navigation machines for systems whose builders were oriented. They encode past understanding into ongoing constraint. They don’t help when the builders themselves were lost.

This is where AI-generated code creates a new problem. If AI is writing code that humans don’t fully understand, who instruments the observability? Who sets the thresholds? Who decides what “normal” looks like?

The system-level navigation machines remain necessary. But they’re not sufficient. There’s another layer of disorientation risk—at the individual developer level—that infrastructure alone can’t address.

That’s the next post.

This post was inspired by Rafael’s “Lost in Between Speed and Scale” at NPC Inc., which is worth reading in full if organizational epistemology is your jam.

Constructed as always with the assistance of Claude Opus 4.5, who processes tokens at rates that would make a circuit breaker nervous.

Dancing with Robots: A Software Architect's Journey

Discussion about this post

Ready for more?