Administrator
Published on 2026-04-21 / 0 Visits
0
0

"The Enterprise AI Security Landscape in 2026: From Guardrails to Trusted Access Programs"

In February 2026, an autonomous AI agent built by CodeWall spent two hours inside McKinsey's Lilli platform and walked out with 46.5 million chat messages, 728,000 files, and data belonging to 57,000 accounts. The attack surface was not a zero-day in the traditional sense. It was 22 unauthenticated API endpoints and a SQL injection vulnerability hiding in a system that had passed conventional penetration testing. What made this breach unprecedented was not the volume of data, but the attacker: an AI agent acting with autonomous intent, exploiting gaps that no human pentester would have prioritized. This is the incident that forces us to stop pretending that the old security playbooks still apply. For more on how one company approached AI-era configuration security at scale, see our deep dive on Meta's approach to trust-but-canary configuration safety.

The Paradigm Shift: Guardrails Were Never Enough

Between 2023 and 2024, the enterprise answer to AI risk was guardrails. Input filters, output classifiers, content moderation APIs, prompt injection detection at the interface layer. These tools had merit, but they shared a fundamental design assumption: the threat was a human typing malicious input. Guardrails are reactive by nature. They work by recognizing bad behavior after it happens or by blocking known-bad patterns before they execute. Neither capability scales to agentic AI, where a system can take dozens of actions across multiple tools in minutes without pausing for human approval.

The security community saw this gap forming as early as late 2024. OWASP's LLM Top 10 for 2025, published in November 2025, tells the story through its additions: "Excessive Agency" entered the list at position six, describing AI systems taking unauthorized actions because no one told them they lacked the authority. "Vector and Embedding Security" appeared at position eight, a direct acknowledgment that retrieval-augmented generation had created a new attack surface that traditional input filtering completely ignores. "System Prompt Leakage" at position seven reflected the growing realization that a model's system prompt is a proprietary asset, not a configuration file.

Guardrails solved the problem they could see. Agentic AI introduced a class of problems that required rethinking the entire access model.

Four Provider Frameworks: A Comparative Analysis

Each major frontier AI provider has published a safety framework, and the differences between them reveal fundamentally different philosophies about where trust should live.

Provider Framework Version Core Mechanism Threat Model Focus
Anthropic Responsible Scaling Policy (RSP) v3.0 (Feb 2026) ASL levels, CEO approval, board notification CBRN capabilities, insider threats
Google DeepMind Frontier Safety Framework (FSF) v3.1 (Apr 2026) CCL thresholds, harmful manipulation tracking Belief/behavior manipulation, misalignment
OpenAI Preparedness Framework Apr 2025 Capability categories, risk tiers Autonomous replication, cyberoffense
Microsoft Frontier Governance Framework Feb 2025 Capability assessment, 3-tier mitigation Cross-tenant exfiltration, model manipulation

Anthropic's RSP v3.0, released in February 2026, represents the most governance-heavy approach. ASL-3 deployment requires sign-off from the CEO, the Responsible Scaling Officer, and board notification. The framework introduced explicit "trusted user" criteria for organizations that want de-safeguarded model versions, a concept that acknowledges access decisions must be relationship-based, not just technical. Notably, RSP v3.0 excludes "complex insider" and "state-compromised insider" scenarios from its threat model, a limitation the company acknowledges but has not resolved.

Google DeepMind's FSF v3.1, published in April 2026, expanded into "harmful manipulation": AI systems that change human beliefs or behaviors in targeted ways. This domain did not exist in earlier versions. FSF v3.1 also added Tracking Capability Levels (TCL), a mechanism for monitoring capability uplift over time rather than just assessing at launch. The framework's emphasis on misalignment risk applies both to external deployments and internal use, a broadening that reflects Google's growing concern about internal AI governance.

OpenAI's Preparedness Framework, last updated April 2025, organizes risk around capability categories: autonomous replication, cyberoffense, CBRN, and persuasion. Each category has defined thresholds and risk tiers. The framework is notable for its table stakes approach: systems that cross certain capability thresholds are presumed to require additional mitigations regardless of intended use. The challenge is that capability evaluation remains partially subjective, and OpenAI has not published the full evaluation methodology.

Microsoft's Frontier Governance Framework, from February 2025, takes a capability assessment approach centered on a three-tier mitigation structure. The tiers map to capability levels, with increasingly stringent requirements as capability increases. Microsoft's framework is the most explicit about cross-tenant data leakage as a primary risk concern, a position vindicated by the EchoLeak incident that would occur later that year.

None of these frameworks is complete. Each reflects the provider's specific threat model, customer base, and risk tolerance. The practical implication for enterprise security teams is that you cannot simply adopt one framework as your own. You must extract the principles that apply to your context and build a composite governance model.

OWASP LLM Top 10 2025: What Changed and Why It Matters

The 2025 update to OWASP's LLM Top 10 introduced changes that reflect a security community beginning to understand agentic AI's attack surface.

Prompt injection at position one is not new, but the 35% figure from Adversa AI's 2025 report makes it concrete: prompt injection accounts for more than a third of real-world AI security incidents. The attack is deceptively simple. An adversary injects instructions through a poisoned external data source, and the model executes those instructions as if they came from a legitimate system prompt. The consequences scale with the system's autonomy.

Excessive agency at position six is the defining addition. This vulnerability describes a system that has been given capabilities it should not have, or that lacks a mechanism to verify that its actions are authorized before executing them. In the McKinsey breach, CodeWall's agent had agency that no human attacker would have been granted: it could initiate API calls across multiple systems without per-request authentication. The lesson is not that the agent was malicious by design. The lesson is that an agent with broad access and no authorization checkpoint is a structural vulnerability, regardless of intent.

Vector and embedding security at position eight addresses a class of attacks that input filtering cannot touch. In a RAG architecture, the knowledge base is the product. When an attacker poisons the document store with malicious chunks, every subsequent query retrieves corrupted context. The attack persists through the retrieval pipeline and influences outputs in ways that input filtering never sees. This vulnerability is particularly dangerous in enterprise contexts where RAG systems are built on proprietary knowledge bases that attackers would value highly.

System prompt leakage at position seven codifies what many enterprises discovered through painful experience: the system prompt is not a configuration file. It contains proprietary reasoning patterns, security instructions, and often credentials that should never reach the user. When a model reveals its system prompt, it is not having a conversation. It is handing over the architecture document for the system that runs it.

NIST AI RMF: The Implementation Order Matters

The NIST AI Risk Management Framework, last updated with the Generative AI Profile (NIST-AI-600-1) in July 2024, provides a four-function structure: Govern, Map, Measure, and Manage. The framework presents these as a cycle, but the practical implementation order for most enterprises is different.

The recommended order is Map first, then Govern, then Measure, then Manage. You must understand what AI systems you have, how they connect to each other, and where sensitive data flows before you can govern them effectively. Governing AI without a map produces policies that miss entire systems. Measuring AI without governance produces metrics that nobody has the authority to act on. Managing AI without measurement produces interventions whose effectiveness you cannot verify. If your organization is still in the early phases of enterprise AI adoption, our guide on moving AI from pilot to production scale covers the maturity progression that typically precedes serious security investment.

The Colorado AI Act (SB 205), which references NIST AI RMF as its safe harbor benchmark with penalties of $20,000 per violation, makes the governance imperative concrete. It is no longer acceptable to say you are "evaluating AI risk." You need a documented, implemented framework. The NIST AI RMF provides the structure. Your implementation must fit your context.

Real Incidents: What the CVEs Actually Taught Us

The incidents of 2025 and early 2026 are not just data points. Each one exposed a specific category of failure that existing security practices were not designed to catch.

CVE-2025-53773 (GitHub Copilot), with a CVSS score of 9.6, allowed remote code execution on more than 100,000 developer machines. The attack vector was an extension marketplace vulnerability that let a malicious extension execute code with the same privileges as Copilot itself. The lesson is not that Copilot was poorly built. The lesson is that AI coding assistants introduce a new trust boundary: the extension ecosystem. Every extension you install runs with your identity. AI tools that can install or recommend extensions are expanding that attack surface in ways that traditional code review misses.

CVE-2025-32711 (Microsoft 365 Copilot EchoLeak) enabled cross-tenant data leakage. The vulnerability exploited the way Copilot accessed documents across different tenant contexts, allowing one organization's data to appear in another organization's query results. This was not a prompt injection attack. It was a bug in the authorization model for a system that had been granted broad document access across an entire organization. The lesson is that AI systems with organization-wide document access create blast radius that no human assistant would have. When you give an AI system access to everything, a single bug affects everyone.

CVE-2025-xxxx (Salesforce AgentForce ForcedLeak), discovered July 28, 2025, and fixed September 8, 2025, with disclosure on September 25, had a CVSS score of 9.4. The attack used indirect prompt injection to manipulate the CRM data exfiltration path. An attacker could embed malicious instructions in a data field that AgentForce would retrieve and execute during normal operations. The indirect injection is harder to detect than direct prompt injection because the malicious content enters the system through legitimate data channels, not through user input. The lesson is that your data is now a potential attack surface. Every field in every database record that an AI system can read is a potential injection vector.

These three incidents share a common thread: the vulnerability was not in the AI model's output. It was in the system surrounding the model: the extension marketplace, the authorization model, the data pipeline. Securing the model is necessary but not sufficient. You must secure everything the model touches.

Security Benchmarks: What the Numbers Mean

Evaluating AI security requires benchmarks, and the ecosystem has matured enough to have several worth knowing.

HarmBench, developed by the Center for AI Safety, evaluates models against a fixed set of harmful behaviors. It is the closest thing to a standardized red team for capability assessment. A model that performs poorly on HarmBench is not necessarily unsafe in deployment, but it has demonstrated capability in areas that most safety teams care about. For context on how frontier models like Claude Sonnet 4.6 perform on structured benchmarks, see our Claude Sonnet 4.6 deep dive for hands-on evaluation data.

AdvBench focuses on adversarial instruction following: how often a model obeys instructions to do harmful things when those instructions are embedded in longer, more sophisticated prompts. Prompt injection is a form of adversarial instruction following, and AdvBench provides a baseline for evaluating defenses.

BOLD (Bias and Open-endedness Language Dataset) evaluates responses to prompts about potentially sensitive social topics. It is relevant for content safety but does not directly address security in the sense of unauthorized access or data exfiltration.

SALAD-Bench is a more recent entrant that specifically targets agentic AI vulnerabilities, including excessive agency scenarios. It is the most relevant benchmark for enterprises deploying AI agents in production environments.

No benchmark is definitive. A model that scores well on all of them can still be exploited through attack surfaces that the benchmarks do not cover. Think of benchmarks as one input to your evaluation process, not the whole answer.

The Trusted Access Model: Moving Forward

The shift from guardrails to trusted access reflects a basic recognition: you cannot filter your way to security when the attacker is an AI agent making autonomous decisions at machine speed. The trusted access model starts from a different premise. Instead of asking "how do we block bad inputs," it asks "which systems and users have earned the trust to act on behalf of the organization."

Implementation follows a staged approach that fits organizations at different scales.

Startup level (under 50 employees): Deploy AI systems with minimal data access by default. Use API-based AI services rather than self-hosted models where possible, because the provider's trust infrastructure is more mature than what a small team can build. Implement basic input/output logging and keep it reviewed, even if reviews are manual.

Growth level (50 to 500 employees): Introduce formal capability assessment before deploying any AI system to production. Map your AI inventory: what data does each system access, what actions can it take, who can invoke it. Begin integrating AI security into your existing security review processes, not as a separate track but as a required checkpoint. Evaluate providers on their published safety frameworks, not just capability and price.

Enterprise level (over 500 employees): Adopt a composite governance model that draws from multiple provider frameworks. Conduct regular red team exercises specifically targeting AI attack surfaces, including prompt injection, excessive agency, and data pipeline attacks. Implement continuous monitoring for AI-specific anomalies: unusual data access patterns, unexpected API calls, anomalous retrieval events. Your security team needs at least one person with deep AI security expertise. This is not a luxury at enterprise scale. It is a structural requirement.

The common thread across all three levels is that security must keep pace with capability. A system that was safe to deploy in 2024 may not be safe to deploy in 2026 when it has been given agentic capabilities, access to more data, and integration with more external tools. Review your AI posture on the same cadence that you review your software deployments.

Practical Recommendations

Before you finish reading this, if your organization has deployed AI systems in the past 18 months, run one query: how many of those systems can initiate API calls without per-request authentication? If the answer is "I don't know" or "more than a few," you have a trusted access problem that guardrails will not fix.

Map your AI inventory before you write your next security policy. You cannot govern systems you have not enumerated.

Evaluate your providers' safety frameworks as seriously as you evaluate their uptime SLAs. The McKinsey breach happened through an AI provider's platform, and the lessons apply to any organization that treats AI vendor security as someone else's problem.

Add AI-specific attack scenarios to your red team schedule. Traditional penetration testing misses the gaps that agentic AI exploits.

Frequently Asked Questions

Q: How are AI security incidents different from traditional software security incidents?

AI security incidents tend to involve multi-step chains rather than single exploitation points. A traditional SQL injection might give an attacker a database foothold. A prompt injection combined with excessive agency can give an attacker a foothold, a document store, and the ability to exfiltrate data across multiple systems without any single step looking obviously malicious. The incident chain is also harder to reconstruct because AI model behavior is non-deterministic in ways that traditional software is not.

Q: Are guardrails completely useless now?

Guardrails are not useless. They are insufficient as your primary defense. Input filtering and content moderation remain valuable for handling known-bad patterns at the interface layer. But they must be combined with access governance, monitoring, and agent-specific controls. Think of guardrails as one layer in a defense-in-depth strategy, not the strategy itself.

Q: How do I evaluate whether my AI provider takes security seriously enough?

Ask for their published safety framework. Read it. Evaluate whether it addresses the risk categories that matter to your use case. Ask about their red team process, their incident response time, and whether they have had third-party security audits. A provider that cannot articulate their safety methodology is not necessarily unsafe, but one that can explain it clearly and has evidence of external review is more likely to have taken security seriously.

Q: What is the single highest-leverage action for improving AI security at a mid-sized company?

Map your AI inventory. Most organizations deploying AI in 2024 and 2025 did so with minimal tracking. They know they use Copilot and ChatGPT, but they have not enumerated the AI embedded in their CRM, their support platform, their code generation tools, and the Shadow AI that employees have connected without IT involvement. The first step to governing AI is knowing it exists. Once you have a map, the policy work becomes possible.

Q: How do trusted access programs relate to existing Zero Trust architecture?

Trusted access for AI is a specific application of Zero Trust principles to AI systems. Zero Trust says no system or user is trusted by default; trust must be earned and verified continuously. Applied to AI, this means your AI systems should not have persistent authorization to act across your organization. They should have scoped access granted for specific purposes, with explicit checks before high-risk actions. The difference from traditional Zero Trust is that AI systems can take chains of actions that individually look low-risk but collectively have high impact. Your Zero Trust implementation must account for this action-chain risk.


Comment