La tua strategia SecOps è pronta per il 2026?

Scarica l’Offensive Security Benchmark Report 2026 per scoprire quali minacce saranno al centro dell’attenzione quest’anno.

No items found.
Trend delle minacce
-
10
mins read
-
March 18, 2026

The authority layer: why AI agent security becomes a systems problem

-
- -
The authority layer: why AI agent security becomes a systems problem

When automation starts acting, the boundary moves

Most debates about AI start in the wrong place. They start with correctness. Can the model answer the question? Can it summarise the document? Can it reason through a task?

That matters, but only up to a point.

The real shift happens when the system stops generating text and starts doing things. It:

opens a browser → Calls an API → Writes to a file → Creates a ticket → Pushes a commit.

At that point, the security problem changes shape. You are no longer dealing with a model that might be wrong. You are dealing with a system that can act while wrong.

That shift is now visible in the public record. In January 2026, NIST and its Center for AI Standards and Innovation put out a request for information on securing AI agent systems, explicitly focusing on the distinct risks that appear when model outputs are combined with the functionality of software systems. That is the right frame. Once an AI system carries authority, the interesting question is no longer only what it says. The interesting question is what it can do, what it can touch, and what happens when it misfires.

A simple analogy helps: a chatbot can hallucinate and still leave production untouched. A junior employee can also make mistakes, but if you give them credentials, tools, and access to workflows, those mistakes have side effects. AI agents look much closer to the junior employee case, except they operate at machine speed, across many systems, and increasingly through multiple cooperating agents. That is why "secure the model" is too narrow a goal. The unit that needs securing is the agent system.

That is the thread running through a cluster of recent papers on agent security, multi agent coordination, offensive evaluation, and autonomous cyber defence. They approach the problem from different directions, but they keep landing on the same point. The real boundary is not the prompt. It is the authority layer around tools, identities, memory, and shared state.

The attack surface moves with authority

Once agents are given the ability to act across systems, the attack surface stops being defined only by exposed services. It starts to follow authority. Wherever the agent can read, write, trigger a workflow, or invoke a tool, the security boundary expands with it.

A useful place to start is Li et al. (2026), Security Considerations for Artificial Intelligence Agents. The paper does something many discussions skip. Instead of focusing on model behaviour alone, it begins with consequences.

Its framing through the CIA triad is helpful because it brings agent security back onto familiar ground for defenders. Confidentiality failures are no longer limited to databases and APIs. They can occur through tool outputs, shared workspaces, memory entries, browser sessions, or webhook responses. Integrity failures are no longer only unauthorized writes by humans or malware. They can emerge when an agent selects the wrong tool, applies the wrong parameter, or acts on poisoned context. Availability failures shift as well. Long running workflows, retries, and chained dependencies introduce new ways for systems to stall or behave unpredictably under pressure.

The paper also makes a point that security teams need to hear clearly. Many of our controls were built for a pre-agent world. They assumed software behavior was relatively narrow, predictable, and easy to map back to explicit logic. Agent systems break that comfort. They act on behalf of users, often with broad privileges, and they do so in ways that are partly planned at run time rather than hard coded in advance. Least privilege and fine grained access control still matter. They just need to be rethought for systems that delegate, chain actions, and change course mid-flow.

The adversary model widens too. In the paper's framing, attackers are not only the obvious external intruder hammering an exposed service. They also include external content providers, component providers, insiders, client side adversaries, and network attackers. That matters because it reframes exposure. In an agent system, any untrusted content channel that can influence the model may become part of the attack surface. A web page. An email. A ticket. A shared document. A pull request description. If the agent can ingest it, reason over it, and let it shape subsequent actions, it belongs in scope.

That is why indirect prompt injection matters so much. The problem is not exotic, it is embarrassingly ordinary. The model sees untrusted instructions inside content retrieved during normal work and cannot reliably separate those instructions from legitimate data. Once you accept that limitation, the defensive posture changes. You stop hoping for a magic prompt that keeps the model pure. You start designing boundaries that separate instruction channels from data channels and limit how far untrusted content can steer tool use, parameters, and downstream actions.

One detail in the paper is especially grounding. It points to CVE 2026 25253 in OpenClaw, a one-click remote code execution issue tied to token exfiltration through a trusted workflow. The key lesson is not that this specific bug involved an LLM making a bad decision. It is that agent security can fail in the surrounding system even without model driven behavior at the centre of the exploit. In other words, the old bugs did not disappear. Agents simply added more routes by which they can matter.

Multi-agent systems inherit distributed systems failures

Split authority across several agents and you get a different class of problem. One agent can misfire. Several agents, each with partial context and some ability to act, can drift, race, and step on each other. Once they start passing messages, working in parallel, and touching shared state, the security question starts to look a lot like distributed systems security.

Mieczkowski et al. (2026), Language Model Teams as Distributed Systems, is useful here because it cuts through a lot of the hand-waving. The paper argues that distributed systems theory gives us a better way to reason about teams of language models, and the results bear that out. In decentralized, self-coordinating teams, the authors observe the same problems distributed systems engineers have been dealing with for years: communication overhead, consistency conflicts, architectural tradeoffs, and stragglers.

That matters for security. If agents have partial observability, they will act on stale state. If they coordinate through message passing, they will misread one another. If they operate concurrently on shared state, they will overwrite, race, and collide. And, if one agent gets something wrong, that error does not necessarily stay contained. It can propagate through the rest of the workflow. The paper's observations of concurrent writes, rewrites, and temporal consistency violations are exactly the kinds of behaviors that become integrity failures once agents are allowed to edit code, alter configuration, or execute operational tasks.

In distributed systems, the slow worker often determines the pace of the whole job. Agent systems get stragglers for their own reasons: long reasoning paths, slow tools, messy context, brittle dependencies. Then the workflow starts waiting, retrying, or moving ahead with only part of the picture. That is when security problems creep in. Actions get replayed. State becomes ambiguous. One agent keeps working from assumptions another agent has already invalidated. Availability and integrity failures stop looking like edge cases. They start to look like normal coordination failures in a system that was never as clean as its demo.

Put plainly, the moment you move from one agent to many, you are no longer designing a clever prompt chain. You are designing a distributed system that happens to speak English.

Offensive research shows where agents break

Distributed systems theory helps explain what breaks when multiple agents have to coordinate. Offensive research gets at a different failure. What happens when an agent keeps going without knowing whether the path it is on is going anywhere.

Deng et al. (2026), What Makes a Good LLM Agent for Real-world Penetration Testing?, matters here because penetration testing is unforgiving. An attack chain either progresses or it dies. The authors review 28 LLM based penetration testing systems and test five representative implementations across three benchmarks of increasing complexity. The split they draw is the part worth keeping. Type A failures come from missing capabilities, weak tooling, or poor tool knowledge. Type B failures are tougher. They remain even after the tooling improves, because the real weakness sits in planning and state management.

The paper argues that Type B failures share a root cause: agents do not estimate task difficulty in real time. They commit too early to bad branches, fail to recognize when reconnaissance is enough to move forward, and burn through context while chasing dead ends. The authors report that adding task difficulty assessment cuts Type B failures without changing Type A. So the old instinct of "just give the agent better tools" only gets you part of the way there.

For defenders, the security lesson sits one layer above the benchmark. Planning quality is also an authority control. An agent that cannot tell whether a path is tractable will keep trying. In a real workflow, "keep trying" means more tool calls, more writes to state, and more chances to leak data, corrupt systems, or trigger an action that never should have been taken. That last step is my inference from the paper rather than the paper’s own wording, but it is the obvious one once the agent has permission to act.

The proposed architecture makes the broader point clear. Serious agent behavior needs typed tool interfaces, some way to judge difficulty before committing effort, and a memory layer that sits outside the model’s conversational context. In other words, reliable agents need explicit state and control around the model. The model is only one part of the system.

ZeroDayBench and the end of magical thinking

By now the pattern should feel familiar. Agents struggle when they have to coordinate. They struggle again when they cannot tell whether a path is worth pursuing. ZeroDayBench takes those weaknesses and puts them in a setting where there is nowhere to hide: novel vulnerability discovery and patching in real code.

That is what makes the paper so useful. ZeroDayBench asks agents to find and patch 22 novel critical vulnerabilities in open source codebases, and the result is sobering. The benchmark reveals that frontier agents still do not solve these tasks reliably on their own. The authors take real CVEs and transplant them into different but functionally similar codebases, which makes the task about transfer and reasoning rather than recall. Then they evaluate patches the way a security team would: does the live exploit still work after remediation, or not.

Too much of this space still slips into abstract talk about capability. This benchmark keeps pulling the question back to something harder to spin. Did the system actually block the exploit. The paper also tests agents across five levels of information visibility, from something close to a true zero-day setting up to full information. The results make the near-term picture fairly clear. With sparse information, the agents do badly. Give them more context and they can improve a lot.

That matters because it argues for a very different deployment posture than the hype cycle wants. In the near term, these systems make more sense as assisted defenders than autonomous ones. They can help narrow the search space, localize faults, draft patches, and support remediation. But it is not the same as handing over broad authority and expecting reliable autonomous hardening across unfamiliar codebases. We are not there.

Set against the previous sections, the point sharpens. Coordination failures show up once agents work together. Planning failures show up once they have to choose a path under uncertainty. ZeroDayBench shows what happens when both of those limits collide with a hard security task that does not forgive bluffing. It is one of the clearest recent arguments against magical thinking in agent security.

Securing the authority boundary

By this point, the shape of the problem is pretty clear. Agents do not fail in one neat way. They fail like software systems fail, except now the failure sits closer to authority. The job is not to make the model pure. The job is to build a boundary that holds when the model is wrong, when coordination breaks down, and when untrusted input gets further into the workflow than it should.

That shifts the engineering goal. You are not chasing perfect reasoning. You are limiting blast radius, narrowing what can be touched, and making it obvious when the system is drifting outside safe bounds. Across the papers, the same answer keeps showing up in different forms: treat agent security as boundary engineering.

If you want a workable blueprint, start with six moves.

First, map the system as an authority graph. Do not start with the prompt. Start with the things that can actually happen: tools, connectors, runtimes, shared workspaces, memory stores, identities, and permissions. Which component can read what. Which one can write. Which one can trigger downstream action. If you cannot draw the authority graph, you probably do not understand the risk surface yet.

Second, separate instructions from data at the boundary. This is the uncomfortable lesson behind indirect prompt injection. The model cannot reliably tell the difference between trusted instructions and untrusted content. So the system has to do that work instead. Treat external text as data unless there is an explicit reason not to. Keep retrieved content away from decision rights wherever you can. Do not let an email, ticket, document, or web page quietly become part of the control plane.

Third, make tool interfaces narrow and typed. A lot of agent systems still expose tools as if flexibility were the main goal. Usually it is the opposite. Broad, ambiguous interfaces make it easier for the model to misuse a tool and harder for the rest of the system to constrain what happened. Narrow interfaces force cleaner contracts. Typed inputs make validation possible. Both reduce authority in practice, not just in theory.

Fourth, treat agent identity and authorization as first-class controls. This part is easy to underbuild because it feels like plumbing. It is not. If an agent can act across systems, identity is part of the security model, not just the deployment model. What identity is the agent using. How far does that identity travel. What can it delegate. What can it approve. Where does human authority end and agent authority begin. Those are design questions now, not afterthoughts.

Fifth, design for coordination failure. Once multiple agents share work, consistency stops being a nice property and starts becoming a security concern. Shared state needs conflict handling. Parallel actions need ordering rules. Message passing needs clear semantics. There has to be some answer to the question of what happens when two agents disagree about the world. If your design assumes a single clean execution path, it will not survive concurrency for long.

Sixth, prove the boundary with exploit-grounded testing. This may be the most important one. A lot of agent security still gets discussed as if design intent were enough. It is not. You need tests that look like the thing you are trying to prevent. Can untrusted content steer tool choice. Can an agent exceed the authority you meant to grant it. Can two agents corrupt shared state under pressure. Can a bad patch still leave an exploit path open. The boundary is only real if you can make it fail in a controlled setting and show that the controls hold.

That is the common thread running through all the previous sections in this article. The answer is not better prompting in isolation. It is narrower authority, clearer state, stronger interfaces, and testing that reflects how systems actually break. Put more bluntly, agent security is not a model contest. It is a controls problem.

External exposure is no longer just the internet

One last boundary shift matters, especially if you think about exposure the way security teams usually do. In agent systems, "external" no longer stops at open ports, public endpoints, and internet-facing services. It also includes outside content that can reach the model and shape what happens next.

That sounds abstract until you make it concrete. A web page. An email. A ticket. A shared document. A pull request description. If an agent can ingest that content, reason over it, and let it influence a tool call, then that surface belongs in scope. The central paper makes this point clearly by treating external content providers as real adversaries, not background noise. That is the right call. Once untrusted content can steer internal authority, exposure stops being just a network question.

That is why continuous validation stops being optional. If authority is the real boundary, then you need repeatable tests at the point where external input meets internal action. Can a crafted page change tool selection. Can a poisoned ticket nudge the agent toward the wrong parameter. Can untrusted text make it further into the workflow than you intended. Those are the questions that matter. Design intent is not enough. You need proof that the boundary holds when the inputs are hostile and the workflow is messy.

This is also why the topic feels familiar to application security people. OWASP's Top 10 for Agentic Applications treats agentic systems as systems that plan, act, and make decisions across workflows. Good. That is a much better frame than pretending the whole problem lives inside the model. The same goes for the NIST work on agent security and agent identity. The direction of travel is pretty clear now. The security community is slowly starting to treat agents like systems with authority, which is what they are.

And that is why this topic sits naturally with Hadrian. Hadrian's whole posture is built around discovering what is exposed and continuously validating what an attacker can actually do with it. Agent security needs the same posture. The only difference is that the exposure path may now be a malicious document or poisoned workflow input, not just an open service on the internet.

That is the practical takeaway I keep coming back to: prove the boundary, then automate inside it.

Five questions are worth asking before you trust an agent with real authority:

  1. Where does authority enter the system today: credentials, tool calls, memory writes, or approval steps?
  2. Which external content channels can reach the agent right now: web pages, tickets, email, docs, chat logs, or pull request text?
  3. If two agents disagree about shared state, which one wins, and how would you know?
  4. If the agent commits to the wrong branch, what limits how many actions it can take before a human steps in?
  5. What is one exploit-grounded test you could run repeatedly to prove the boundary holds, instead of just assuming it does?

If you are deploying agentic workflows and want to move from theoretical safety to measurable assurance, connect with me on LinkedIn. I work with security teams to turn autonomy into something testable, using adversarial validation, structured evaluation, and CTEM thinking to continuously pressure test high privilege tool use, context propagation, and real attack paths, not just assume the boundary holds.

{{related-article}}

The authority layer: why AI agent security becomes a systems problem

{{quote-1}}

,

{{quote-2}}

,

Related articles.

All resources

Trend delle minacce

AI turns simple FortiGate gap into breaches

AI turns simple FortiGate gap into breaches

Related articles.

All resources

Trend delle minacce

Open-source AI is accelerating the weaponization cycle

Open-source AI is accelerating the weaponization cycle

Trend delle minacce

Client-Side Template Injection in GitBlit

Client-Side Template Injection in GitBlit

Trend delle minacce

AI turns simple FortiGate gap into breaches

AI turns simple FortiGate gap into breaches

get a 15 min demo

Start your journey today

Hadrian’s end-to-end offensive security platform sets up in minutes, operates autonomously, and provides easy-to-action insights.

What you will learn

  • Monitor assets and config changes

  • Understand asset context

  • Identify risks, reduce false positives

  • Prioritize high-impact risks

  • Streamline remediation

The Hadrian platform displayed on a tablet.
No items found.