Five Failure Modes in LLM Agents That Single-Turn Alignment Evaluations Don't Catch
I ran 384 adversarial trials across 8 attack categories, 6 agent configurations, and 4 interaction-length conditions. The main finding: violation rates rise 68.3% from single-turn to 7-turn interactions.
Motivation
Standard alignment evaluation methodology tests models on individual turns: present a prompt, observe the response, record pass or fail. This is clean, reproducible, and — I think — misleading for the specific case of agentic deployment.
The question I wanted to answer empirically: do safety constraints that hold under single-turn evaluation also hold when adversarial pressure accumulates across a multi-turn interaction? The short answer is no, and the gap is larger than I expected going in.
This is a small-scale empirical study — 384 trials, one researcher, no peer review yet. The findings are directional evidence, not settled results. I'm sharing them here because I think the methodology is worth scrutiny and the failure modes seem undercharacterized in the literature.
Experimental Setup
The evaluation framework (agent-redteam) tests LLM-based agents across a structured taxonomy of adversarial interaction classes, each run under four interaction-length conditions: 1-turn, 3-turn, 5-turn, and 7-turn. Attack categories included direct and indirect prompt injection, identity manipulation, privilege escalation, scope ambiguity exploitation, helpfulness override, compositional multi-step attacks, and trust propagation. Severity was scored on a 0–10 CVSS-like scale across 6 agent configurations varying tool access level and system prompt specificity.
Results: Multi-Turn vs. Single-Turn Violation Rates
Interaction length
Mean violation rate
vs. single-turn
1-turn (baseline)
38.4%
—
3-turn
51.2%
+33.3%
5-turn
58.9%
+53.4%
7-turn
64.7%
+68.5%
Full tool access increases violation rates 109% over the no-tool baseline. The mechanism appears to be that tool outputs create additional vectors for trust propagation and indirect injection.
The Five Failure Modes
Across the trials, violations clustered into five structurally distinct patterns.
1. Instruction override — Direct displacement of system-level constraints by user-turn instructions. More common at longer interaction lengths, likely because accumulated context reduces the effective salience of the original system prompt.
2. Trust propagation — Early turns establish a false trust signal. The model treats a claimed identity or authority as verified, and that trust persists into later turns where it enables violations that would have been blocked in turn 1.
3. Context drift — The implicit frame of the interaction shifts gradually across turns without any single turn crossing a refusal threshold. By turn 5 or 7, the model is operating under a substantially different implicit frame than the one established at the start. This failure mode is not detectable by single-turn evaluation at any individual point in the trajectory.
4. Scope ambiguity exploitation — Underspecified task boundaries in the system prompt leave room for adversarial reinterpretation, with the model defaulting to permissive interpretations under adversarial framing.
5. Helpfulness override — The model produces a constraint-violating output because it has been framed as the helpful or cooperative response. More common in configurations with detailed, helpfulness-emphasising system prompts.
Implications
The core implication is that alignment is a dynamic property across interaction sequences, not a static property of individual responses. Evaluations that sample single turns from an adversarial distribution will miss the highest-risk moments in multi-turn trajectories. This is directly relevant to how we evaluate safety in agentic deployment — autonomous agents, long-running assistants, and systems with delegated tool access are exactly the contexts where multi-turn interactions are the norm.
Security Is a Shared Responsibility, and the Browser Has to Own Its Part
Manufacturers, vendors, and platform builders have a responsibility to ship products that are secure by default. The enterprise browser is one of the few layers where that responsibility can be meaningfully exercised at scale.
The Problem With "Security as an Add-On"
The software industry has a long-standing habit of treating security as a feature to be layered on after the fact, a checkbox, an upsell, a professional services engagement. Products ship with permissive defaults. Hardening guides are published as afterthoughts. Security tooling is sold separately to the same organisations that just bought the product that created the exposure.
This model puts the burden entirely on the buyer. IT teams are expected to configure, harden, monitor, and respond, with limited staffing, competing priorities, and vendor documentation that assumes expertise they may not have. The outcome is predictable: misconfigured deployments, defaults that never get changed, and a security posture that reflects what was feasible rather than what was necessary.
The principle of secure by design and secure by default challenges this model directly. It says that the burden of security should sit closer to the people who build the product, not just the people who deploy it. Products should be designed from the ground up with adversarial conditions in mind. Default configurations should represent the safest reasonable state. Features that introduce risk should require deliberate enablement, not deliberate disablement.
What This Looks Like in Practice
Secure by design is not just a development philosophy, it's an architectural commitment. For a browser, it means threat detection is not a plugin, not an extension, and not a proxy that can be bypassed. It means the security model is part of the rendering engine's decision loop, not something listening at the network edge hoping to catch what slips through.
Secure by default means that when Spirex is deployed, the protective capabilities are active out of the box. DLP controls don't require a consultant to configure. Phishing detection doesn't need a signature subscription. Page scoring runs immediately, because waiting for configuration means the first day of use is the most dangerous.
This matters most at the point of deployment. Most breaches involving browser-layer attacks happen to organisations that had security tools deployed but not properly configured. The attack didn't defeat the tool, it found the gap the tool left open because someone hadn't finished a configuration task.
The Browser as a Multi-Faceted Security Layer
What makes the browser unusual as a security control point is that it sits at the intersection of identity, application context, user behaviour, and external content, simultaneously. No other component in an enterprise stack sees all four at once.
A network firewall sees traffic but not identity. An identity provider sees authentication but not what happens after. A DLP tool sees data movement but often not the page context that made it risky. An endpoint agent sees process activity but not the live content a user is looking at.
The browser sees the authenticated user, the specific application they're in, the exact content that has loaded, and the action the user is about to take, all in the same moment. That's a uniquely powerful position from which to enforce policy. But only if the browser is built to use it.
A consumer browser repurposed for enterprise use can't take advantage of this position because it wasn't built with enterprise policy enforcement in mind. An enterprise browser built on secure-by-design principles can enforce controls at exactly the right moment, not before the page loads, not after the data has moved, but at the point of interaction, when there is still time to intervene.
The Responsibility That Vendors Need to Accept
Governments and standards bodies are beginning to formalise what secure by design and secure by default mean in practice. CISA's guidance, the EU Cyber Resilience Act, and the UK NCSC's product security frameworks all point in the same direction: vendors are accountable for the security properties of what they ship.
This is a reasonable position. A browser vendor knows far more about the attack surface of their product than any individual deploying it. They have the engineering resources to harden it, the telemetry to know where attacks land, and the distribution mechanism to push protective defaults to every installation simultaneously. The buyer has none of these advantages.
For the enterprise security market to mature, vendors need to stop selling security as an optional layer and start treating it as the baseline. That means shipping products where the secure configuration is the default configuration, where threat coverage is built into the architecture rather than bolted onto it, and where customers aren't left to figure out hardening on their own.
The enterprise browser is one of the few places where this commitment can be made meaningfully, and where the effect of getting it right is felt directly by every user in an organisation every day. That's a significant responsibility, and one worth taking seriously.
Zero Trust · Mobile Workforce2025
The Browser as the Final Frontier of Zero Trust
A perspective on how enterprise browsers like Spirex complement zero trust adoption, the mobile workforce shift, and the evolving reality of user behaviour at the edge.
Zero Trust Was Never Finished at the Perimeter
Zero trust architecture has reshaped how enterprises think about access. The model is clear in principle, verify identity, enforce least privilege, assume breach. But in practice, most zero trust deployments focus on the gate: who gets in, from which device, through which network path. What happens after authentication is largely invisible to those controls.
A user who has passed every identity check can still open a browser tab, land on a phishing page that mirrors their corporate login portal, and hand over credentials to an attacker. A contractor with a valid session can take a screenshot of a sensitive document, paste it into a personal email draft, and walk out the door. None of this generates an alert in a network-layer zero trust tool because the tool stopped watching the moment access was granted.
The browser is where authenticated sessions meet live web content. That intersection, user intent meeting arbitrary external code, is where most modern attacks actually land. Zero trust frameworks that don't extend into the browser session are architecturally incomplete.
Spirex fills that gap by applying the same principles inside the session that zero trust applies at the boundary: verify the context of every page, enforce least privilege on what users can do within apps, and treat every browsing event as potentially hostile until signals prove otherwise. It doesn't replace identity providers, ZTNA, or SSE, it adds a control point that activates where those tools stop.
The Mobile Workforce Changed the Equation
Enterprise security was built around a predictable model: managed devices, fixed locations, a defined network perimeter. The mobile workforce dismantled all three simultaneously. Users work from home networks, airport lounges, hotel WiFi, and personal devices. The managed device is increasingly the exception, not the rule.
Network-layer controls struggle with this reality. A VPN assumes a managed endpoint. A corporate proxy assumes traffic flows through a known path. Neither assumption holds when your workforce is distributed across unmanaged environments on uncontrolled networks.
The browser, however, is everywhere. It runs on managed and unmanaged devices alike. It's the interface through which most enterprise work actually happens, SaaS applications, cloud storage, internal tools, communication platforms. If you can put security controls inside the browser itself rather than around the network it runs on, you have enforcement that travels with the user regardless of where they are or what device they're using.
For a mobile workforce, a secure enterprise browser isn't a complementary control, it's the primary one. The network perimeter is gone. The browser is what remains.
User Behaviour Is the Attack Surface
Technical controls often assume adversarial external actors. But a significant share of security incidents involve ordinary user behaviour: clicking a convincing phishing link, pasting sensitive data into an AI tool, downloading a file to a personal machine, or reusing credentials across accounts. These aren't failures of policy, they're failures of enforcement at the moment of action.
URL reputation filtering and email gateways address a subset of these scenarios, but only upstream of the browser. Once a page has loaded, those controls are no longer in the loop. The browser has to make its own judgment about what the user is looking at and what actions are about to happen.
Spirex's page scoring engine evaluates live content, not just the URL, at load time and continuously as the DOM evolves. Brand-to-domain mismatches, off-domain credential capture forms, client-side injection patterns, hidden fields, and suspicious redirects are all evaluated in context. DLP controls engage at the moment of interaction, screenshot prevention, download blocking, upload watermarking, based on what the user is doing, not just where they are.
This is what it means to build security around user behaviour rather than around network topology. The enforcement point moves from the infrastructure to the interaction.