Late last year I was part of an incident review at a fintech I was advising. They had four AI agents running in production. When I asked the engineering lead to walk me through what each one could access, there was a long pause. He knew roughly what the agents did. He did not know which database credentials were in scope, who owned the rollback procedure, or whether any of the agents had outbound network access that nobody had explicitly intended to grant.
Nothing had gone wrong yet. But they were operating on luck.
That conversation is not unusual. Monte Carlo's "Agents in Production" report for 2026 found that 64% of enterprise teams deployed AI agents before they felt operationally ready. Most had the same gaps: no inventory, no rollback procedure, no cost attribution, no way to answer a pointed security question if one ever got asked. The Cloudflare restructuring earlier this year, which followed a 600% surge in API usage over the prior 18 months, is the clearest signal of what the downstream effects look like when an organization scales AI adoption faster than it scales the discipline around it.
This post is not a cautionary tale. The agents are already running. The question is what to do about it over the next 90 days.
Table of Contents#
- The problem has a number now
- Step 0: Build the inventory
- Days 1-30: Security triage
- Days 31-60: Guardrails as code, not policy
- Days 61-90: Observability and cost attribution
- The 90-day sequence summarized
- What the 36% got right
- FAQ
The Problem Has a Number Now#
For a while, the concern about AI agent discipline felt theoretical. Security researchers flagged prompt injection. Architects warned about scope creep. Production teams were shipping fast and things were mostly fine, so the warnings felt academic.
They are less academic now. 64% of enterprise teams deployed before they felt ready. That is not a survey of edge cases. That is the majority of teams doing serious enterprise AI work in 2026, a full year into meaningful production adoption. The gaps those teams reported are specific: no centralized inventory of what agents exist and what they can access, no defined rollback procedure when an agent does something unexpected, no attribution of token costs to specific features or teams, no governance on what data an agent is permitted to read or write.
What "not ready" looked like varied by company, but the pattern was consistent. A team builds an agent to automate something. It works. They ship it. Then the agent grows in scope informally. Someone gives it access to another database table. Someone adds a new tool. The agent starts handling cases it was not originally designed for, and nobody has a clear picture of the full blast radius if something goes wrong.
The Cloudflare data point deserves a moment of attention. A 600% increase in API usage over 18 months, followed by a restructuring that removed over 1,100 roles. That is not a company that moved slowly. The restructuring is the lagging indicator of decisions that were made well before the consequences were visible. The decisions happening now at most enterprises will have a similar lag before they surface.
First movers in enterprise AI are now at the hardening stage. Companies that deployed in 2024 and early 2025 have the most agents running, the longest tail of undocumented behavior, and in many cases the most exposure. This post is for engineering leads and platform architects at those companies.
Step 0: Build the Inventory#
Before any security work, before any guardrails, before any observability tooling, you need to know what you have. Most teams cannot answer three basic questions: how many agents are running in production right now, what data can each one touch, and who approved each one.
I have asked these questions at several companies over the past year. The answers were almost always wrong in the same direction. Someone would say "I think seven or eight" and the actual count, once someone actually looked, would be closer to fifteen. Agents do not announce themselves. They get built during a sprint, deployed because they work, and then run quietly until something breaks.
The agent registry does not need to be sophisticated. A spreadsheet is fine to start. The act of building it is the point. For each agent, capture the following:
| Field | What to Document |
|---|---|
| Agent name | What you call it internally |
| Trigger mechanism | Scheduled cron, webhook, user-initiated, event-driven, or chained |
| Data access scope | Every data source, with read vs. write distinction |
| Credentials held | Specific keys, service accounts, or tokens in scope |
| Owner | A specific person, not a team |
| Last reviewed date | When was the access scope last confirmed as intentional |
| Rollback procedure | How to stop it and undo its last N actions |
That last field is the most revealing. If you cannot write down a rollback procedure, you do not have one. And if you do not have one, you have an agent in production that you cannot safely undo.
Finding undocumented agents takes real effort. Start with CI/CD configs and look for jobs that initialize an LLM SDK or call an inference endpoint. Search the codebase for framework initialization patterns specific to whatever SDKs your team uses. Audit API key usage logs from your LLM providers and look for keys generating calls you cannot account for. Review scheduled jobs and background workers. Agents hide in places that were never meant to be agent infrastructure.
The goal at this stage is visibility, not control. You cannot harden what you cannot see. The registry does not need to be perfect on day one. It needs to exist and be honest about the gaps.
Days 1-30: Security Triage#
Security failures from production agent deployments cluster into two distinct categories: CVE-based vulnerabilities and design-based vulnerabilities. They require different fixes, but both are tractable.
CVE-based failures are the ones that show up in security bulletins. CVE-2026-39861, a sandbox escape in one of the more widely used agent execution environments, is a recent example. The fix for these is procedural but requires actually doing it: patch affected versions, verify sandbox configuration against the vendor's hardening guide, and explicitly restrict process-level filesystem access for every agent runtime.
Design-based failures are subtler and more common. These happen when an agent was given broader access than it needed because it was convenient during development and nobody scoped it down for production. The agent that has write access to the production customer table because the developer used their own environment config during the hackathon. That is not malicious. It is just how fast-moving development works. The fix is minimum viable permissions applied per agent, not per environment, not per team. Per agent. Every agent gets exactly the credentials it needs for its specific job and nothing more.
The irreversibility flag pattern is worth implementing from the start of this phase. Any action an agent can take that cannot be reversed within five minutes requires explicit human review before execution. Not just an approval prompt that can be bypassed, but a genuine interrupt where a human sees the proposed action and confirms it. Deleting records, writing to production data stores, sending external communications, and exporting data all qualify. This is not an onerous operational requirement, but it dramatically reduces the consequence surface of an agent behaving unexpectedly.
One area that gets overlooked consistently: every file your agent reads is now a potential attack surface for prompt injection. If an agent reads a customer support ticket, a document, an email, or any content created by someone outside your organization, that content could contain instructions designed to redirect the agent's behavior. This is the class of attack sometimes called "Comment and Control." Content scanning for injected instructions before those inputs reach the agent's context is necessary, particularly for agents that can trigger irreversible actions.
48-Hour Security Checklist#
Run through this in the first 48 hours. It is not comprehensive, but it addresses the highest-impact gaps.
- Identify every agent that has shell or subprocess execution capability. In most production environments this should be zero unless you have an explicit, documented reason. If it is not zero, fix it now.
- Identify every agent that has outbound network access to arbitrary hosts. This should also be zero. Agents should be able to reach specific, enumerated endpoints, not the open internet.
- Confirm all database credentials used by agents are scoped to the specific operations that agent performs. Environment-level credentials create a blast radius problem.
- Replace any credential shared between an agent and other systems with a credential scoped only to the agent.
- Enable execution logging for every agent: file writes, shell commands, external API calls, and LLM calls, each tagged with agent ID and session context.
- Verify that LLM provider API keys used by agents are distinct from those used by your application layer. Shared keys mean shared blast radius.
- For each agent: list every action it can take that cannot be reversed within five minutes, and confirm a human review step exists for each such action.
- Audit which agents have write access to any production data store and confirm that write access was explicitly intended and is still needed.
The point of the 48-hour checklist is not to complete the security work. It is to close the worst gaps before you work through fuller remediation. Think of it as stopping the bleeding.
If you want to think more carefully about how the architecture of agent workflows shapes the security surface in the first place, the post on agentic LLM workflow patterns covers how structure-level decisions create or limit exposure at the design stage.
Days 31-60: Guardrails as Code, Not Policy#
The governance instinct after a security review is to write a policy document. Something that says "agents must not access production databases directly" or "all agent-initiated external communications require prior approval." Policy documents are not useless. They are just insufficient for agents.
Agents do not read policy documents. Humans do. Humans are not in the loop on every agent action, which is precisely why you built the agent. A policy that relies on humans enforcing it at execution time is a statement of intent, not a control.
The work of days 31 to 60 is converting those statements of intent into code that runs on every execution.
Input validation is the first layer. Before an agent processes a task, the task input should be validated against a schema that defines what the agent is allowed to receive. If the input does not conform, the agent does not run. This sounds obvious but most teams skip it because the input is coming from "trusted" internal systems. Trust is not the issue here. Schema validation catches malformed inputs, edge cases, and the occasional injection attempt before they ever reach the agent's reasoning loop.
from pydantic import BaseModel, field_validator
from typing import Literal
class SupportAgentTask(BaseModel):
task_type: Literal["summarize", "classify", "draft_response"]
ticket_id: str
max_response_tokens: int = 500
dry_run: bool = False
@field_validator("ticket_id")
@classmethod
def validate_ticket_id(cls, v: str) -> str:
# Reject anything that does not look like a legitimate ticket reference
if not v.startswith("TKT-"):
raise ValueError("ticket_id must follow TKT-* format")
return v
Output classifiers are the second layer. For high-risk action categories, the agent's proposed action passes through a classifier before execution. High-risk categories are: any delete operation, any write to a production data store, any external communication (email, webhook, third-party API call), and any data export. The classifier does not need to be a large model. A faster, cheaper model with a focused prompt asking "does this proposed action fall within the approved scope for this agent type?" is sufficient. The goal is a second opinion before consequence.
Human-in-the-loop interrupts handle cases where automated classification is not enough. LangGraph 1.0 introduced typed interrupts as a first-class pattern for this. You declare checkpoint nodes in the agent's execution graph at points where irreversible actions could occur. Execution pauses, the proposed action is surfaced to a reviewer, and the agent resumes only on explicit approval. The interrupt should carry enough context for the reviewer to make a real decision: what the agent is about to do, why it decided to do it, and what will happen if they approve.
# LangGraph 1.0 typed interrupt at an irreversible action point
from langgraph.types import interrupt
def handle_production_write(state: AgentState) -> AgentState:
proposed = state["pending_action"]
decision = interrupt({
"description": "Agent is requesting a production write",
"action": proposed,
"reasoning": state["last_reasoning_step"],
"reversible_within_5_min": False,
"affected_record_count": proposed.get("record_count", "unknown"),
})
if decision["approved"]:
return apply_action(state, proposed)
return reject_action(state, decision.get("reason", "Rejected by reviewer"))
Rollback procedures deserve their own focused effort. For every agent in your inventory, the question is: if I need to undo the last thing this agent did, what is the specific procedure and how long does it take? Not "can we recover with some effort" but a documented, rehearsed answer. If an agent cannot be given a credible rollback procedure, it should not be in production. An agent making permanent decisions with no recourse is a liability.
Testing guardrails means testing adversarial inputs. Deliberately construct inputs designed to violate the guardrails and confirm the guardrails hold. Boundary conditions, edge cases, inputs that are structurally valid but semantically problematic. The failure modes your agents were never designed to handle are the ones that will occur in production. Test for them before production finds them for you.
The LangGraph deep dive goes deeper on the interrupt system and state management patterns if you are implementing this in LangGraph specifically.
Days 61-90: Observability and Cost Attribution#
Agents without attribution are invisible in two ways at once: invisible in your budget, and invisible in the incident timeline. Both of those invisible modes will create problems at exactly the moment you can least afford them.
The token attribution pattern is the foundation. Every LLM call made by any agent gets tagged before it hits the API with at minimum four fields: agent_id, feature_id, user_id, and session_id. These tags are not a logging afterthought. They go on the request itself so they appear in your provider's usage records and in whatever cost analysis tooling you use downstream.
A thin proxy layer is the cleanest enforcement point. All agent LLM calls route through the proxy. The proxy enforces tagging, logs token counts per agent per session, and applies per-agent budget limits with hard daily caps. When an agent exceeds its daily budget, it stops. Not a warning, a hard stop with owner notification. An agent that can consume unbounded tokens is a cost incident waiting for a trigger.
import httpx
import logging
logger = logging.getLogger(__name__)
DAILY_TOKEN_LIMITS: dict[str, int] = {
"support-triage-agent": 500_000,
"data-pipeline-agent": 2_000_000,
"report-generation-agent": 300_000,
}
# In production, back this with Redis and a midnight TTL reset
_daily_usage: dict[str, int] = {}
def call_llm(
*,
agent_id: str,
feature_id: str,
user_id: str,
session_id: str,
payload: dict,
api_key: str,
base_url: str = "https://api.anthropic.com/v1/messages",
) -> dict:
limit = DAILY_TOKEN_LIMITS.get(agent_id)
if limit is not None:
current = _daily_usage.get(agent_id, 0)
if current >= limit:
logger.error(
"agent=%s daily_limit_exceeded current=%d limit=%d",
agent_id, current, limit,
)
raise RuntimeError(
f"Agent {agent_id} has reached its daily token limit. "
"Owner has been notified."
)
tagged_payload = {
**payload,
"metadata": {
**(payload.get("metadata") or {}),
"agent_id": agent_id,
"feature_id": feature_id,
"user_id": user_id,
"session_id": session_id,
},
}
response = httpx.post(
base_url,
json=tagged_payload,
headers={
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
timeout=120.0,
)
response.raise_for_status()
result = response.json()
usage = result.get("usage", {})
total = usage.get("input_tokens", 0) + usage.get("output_tokens", 0)
_daily_usage[agent_id] = _daily_usage.get(agent_id, 0) + total
logger.info(
"agent=%s feature=%s session=%s input=%d output=%d",
agent_id, feature_id, session_id,
usage.get("input_tokens", 0),
usage.get("output_tokens", 0),
)
return result
Token attribution is necessary but not sufficient. The full observability picture requires tracing every tool call, every file read, every external API call, with timing and outcome. You want to be able to reconstruct the complete sequence of decisions and actions for any agent session. OpenTelemetry spans work well for this if you tag every significant operation with the same identifiers used for token attribution, so you can correlate LLM usage with tool usage and data access within a single session trace.
For regulated industries, "we have logs" is genuinely not sufficient. The requirement is structured, queryable, diffable records of what the agent decided, in order, with the reasoning that produced each decision attached. Logs tell you what happened. An audit trail tells you why each thing happened in a form a third party can verify. LangSmith handles this reasonably well for LangGraph-based agents. OpenTelemetry with custom instrumentation provides a vendor-neutral infrastructure-level view. Re_gent is a newer option purpose-built for agent audit trails. The choice depends on your existing stack and compliance requirements. What matters is that the choice is made and implemented.
Once you have per-agent token data, you can calculate cost per productive unit: cost per document processed, cost per support ticket resolved, cost per successful extraction. This is the data that makes honest conversations about agent economics possible. Without attribution, those conversations are guesses.
The post on agentic app build and run costs covers the economic modeling for agents in more depth, including the token cost structures that make per-agent attribution so important at scale.
The 90-Day Sequence Summarized#
The three phases build on each other. The dependencies run in one direction.
You cannot do security triage without the inventory because you do not know which agents to harden. You cannot write meaningful guardrails without knowing the security gaps because the guardrails need to address the actual risk profile of each agent. You cannot build useful observability without the guardrails in place because the observability system needs to know what normal looks like in order to flag deviations.
Days 1-30 are about knowing what you have and closing the worst security gaps. Build the inventory. Run the 48-hour checklist. Fix shell access, network access, and credential scope. Enable execution logging. By day 30, you should be able to state, for every agent in production, what it can access and what credentials it holds.
Days 31-60 are about replacing policy documents with guardrail code. Input validation schemas, output classifiers on high-risk action categories, human-in-the-loop interrupts on irreversible operations, and documented rollback procedures for every agent. The output of this phase is a codebase where the constraints are enforced by the system rather than trusted to humans who are not watching every execution.
Days 61-90 are about instrumentation. Token attribution on every LLM call with hard daily limits. Full trace coverage across tool calls and external API calls. Structured audit trails appropriate to your compliance requirements.
What "done" looks like is a specific question, not a feeling. At the end of day 90, you should be able to answer the CISO question without scrambling: for each agent running in production, what can it access, what is it explicitly prevented from doing, who owns it, and how would you know within five minutes if something went wrong? If you can answer that for every agent in your inventory, the 90 days were successful.
That is not a high bar in absolute terms. It is standard SRE practice applied to a new class of system. The reason it takes 90 days for most teams is not that the work is complex. It is that it has to be done carefully, in sequence, while the agents continue to run.
What the 36% Got Right#
The teams that deployed with operational discipline did not move slower. They moved with different defaults.
The pattern that distinguishes them is treating agents like microservices from day one. Separate deployment pipelines per agent. Access controls defined as code and checked into the same repository as the agent logic, not added later. Runbooks written before the first deploy, covering both normal operation and the failure modes the team had thought through in advance.
This sounds obvious in retrospect. In late 2024, when the pressure to ship agents was intense and the operational frameworks for doing it carefully were still being worked out, it was not obvious at all. The teams that got it right were the ones that had already internalized SRE discipline from running microservices and applied it without being asked. The discipline was not retrofitted. It was part of the initial build.
The lesson is not to move slower. Pace of AI adoption is a real competitive factor and there is no version of this advice that says "you should have waited." The lesson is to bring your existing operational discipline with you. The agent deployment problem is not a new problem. It is a microservice deployment problem with a few new dimensions around non-determinism and unbounded action scope.
The tooling available now is meaningfully better than it was 12 months ago. LangGraph 1.0 has typed interrupts and first-class support for human-in-the-loop patterns. Mem0 provides persistent memory management without custom infrastructure. Re_gent addresses the audit trail problem directly. Google's Agent Development Kit provides sensible defaults for access scoping. The disciplined path is easier to take now than it was when most of the rushed deployments happened.
The 36% are not standing still. They are building on a reliability foundation while other teams spend engineering cycles on remediation. An agent infrastructure with good observability, well-defined guardrails, and a clean security posture can be extended faster and more safely than one without those properties. The gap is widening, not closing, because the 36% are still compounding while the 64% are paying down technical debt.
The 90-day sequence in this post is the shortest credible path to closing that gap for teams that already have agents in production. It is not the ideal path. The ideal path was taking this discipline seriously before the first deploy. But it is the realistic path from where most teams are right now.
FAQ#
How do I prioritize which agents to harden first?#
Use a two-dimensional risk score: sensitivity of data access multiplied by current monitoring coverage. The highest priority agents are the ones with production data access that are currently unmonitored. After that, prioritize by irreversibility of actions: agents that can delete or send external communications rank above agents that only read and summarize. Then by volume: agents running thousands of times per day rank above agents running a handful of times per week. The 48-hour checklist in the Days 1-30 section applies to every agent regardless of tier, but the order in which you do deeper remediation should follow this logic.
What if my agents were built by a vendor, not internally?#
Two tracks in parallel. Contractually, your vendor agreement should require that vendor-built agents meet the same security and observability controls you apply to internal agents. If your current contract does not include this, add it at the next renewal or renegotiation. Request their security documentation, confirm they have audit logging, and get written commitment on access scoping. Technically, regardless of what the vendor claims about their implementation, add a proxy layer you control between your infrastructure and the vendor's agent. You own the data. The proxy gives you enforcement capability and logging visibility independent of what the vendor's system does or does not record.
Is this overkill for a small team with only a few agents?#
The sequence scales down significantly. For a team with three agents, the Days 1-30 phase is a half-day exercise, not a month. The 48-hour checklist takes an hour when you only have three agents to audit. The guardrail work in days 31-60 is proportionally simpler when the action surface is small. The observability work in days 61-90 is straightforward when you are tagging three agents instead of thirty. The value of the sequence is in the order of operations, not the calendar time. Skipping the inventory to jump straight to observability means building dashboards for agents you have not fully cataloged yet. That is a common mistake at every scale.
What happens if an agent fails during the hardening process?#
That is the point. Failures during hardening are recoverable. You have the agent's scope limited, logging enabled, and you have not scaled the agent's usage under the assumption that it is fully safe. A failure mode discovered while deliberately probing the system during hardening is exactly what the hardening process exists to find. A failure mode discovered after you have scaled to 10x volume with the agent running unmonitored for months is a different category of problem entirely. Treat failures during the 90-day process as information. They are evidence the process is working, not evidence the work was a mistake.
How do I get executive buy-in for this work?#
The Cloudflare data point is the most credible framing for this conversation. Their restructuring is the downstream consequence of operational decisions made roughly 18 months before the consequence became visible. The question to put to leadership is not "should we spend 90 days on this." It is "what decisions are we making right now that we will be responding to 18 months from now?" The hardening work is technical debt prevention taken at the point when prevention is still cheaper than remediation. Frame it that way. Debt prevention is an investment. Debt repayment is a cost. The 90-day sequence is the investment version, taken while the cost is still manageable.