Security solutions

mins read

April 14, 2026

The AI offensive security boom: Seventy tools in eighteen months

Hadrian's research team has cataloged 70 open-source AI penetration testing tools as of March 2026. Fewer than five existed before GPT-4's release in April 2023. The remaining 65-plus launched in the 18 months that followed, spanning autonomous end-to-end agents, vulnerability discovery and exploit generation, AI-assisted binary reverse engineering, recon and attack planning platforms, guardrail-free language models trained on offensive security data, LLM red-teaming frameworks, and CTF agents that serve as proving grounds for autonomous exploit capability.

The scale of this ecosystem matters, but it is not the most important thing about it. What matters is how these tools operate: not sequentially, the way a human pentester works, but in parallel across an entire attack surface at once.

A penetration tester, even a very good one, works in series. They scan a target, wait for results, interpret the output, decide what to try next, execute, and repeat. Each step depends on the one before it. AI removes that dependency. It runs reconnaissance across every subdomain, port, and service simultaneously. It tests every known exploit against every discovered endpoint concurrently. It does not context-switch, does not lose track of what it found three hours ago, and does not deprioritize a target because it got distracted by a more interesting one.

That difference, sequential versus parallel execution, is the reason the economics of offensive security have changed so dramatically. The cost figures are not just about cheaper compute. They reflect a fundamentally different operational model.

The cost of attacking has moved by orders of magnitude

In February 2026, a research team published benchmark results for Excalibur, an LLM-based penetration testing agent built on PentestGPT V2. The task was a realistic Active Directory engagement: five hosts, multiple domains, real lateral movement required. The agent compromised four of the five hosts. The total cost in LLM API fees was $28.50. The agent was not working through those hosts one at a time. It was running multiple exploitation paths across the environment concurrently.

A manual penetration test of equivalent scope, conducted by a credentialed firm, runs somewhere between $15,000 and $50,000 depending on complexity and engagement length. The agent did most of the same work for the price of a lunch.

The Excalibur figure is not an outlier. RapidPen, published in February 2025, achieves IP-to-shell access in an average of 200 to 400 seconds at a cost of $0.30 to $0.60 per run. CAI, the open-source cybersecurity agent framework from Alias Robotics, ran a structured comparison against expert human testers and found a 156-times cost reduction, with a representative engagement costing $109 versus $17,218 for the human-led equivalent, while running 3,600 times faster. AutoPentester, benchmarked in October 2025, outperformed PentestGPT by 27 percentage points on subtask completion with 39.5% more vulnerability coverage.

Each of these systems uses a different architecture and targets different parts of the kill chain. What they share is that the marginal cost of executing a known attack chain against a known target is trending toward zero, and the reason is parallelization. These systems do not replicate a human pentester's workflow at lower cost. They execute offensive operations across every reachable surface simultaneously, collapsing what would take a human team days or weeks into minutes.

The implication is that a competent attacker can probe your external perimeter for a few dollars. The volume of offensive activity that was previously constrained by labor is now constrained almost entirely by infrastructure costs.

The autonomy gap is real, but parallelization changes the math

A paper published in June 2025 by Víctor Mayoral Vilches established a six-level taxonomy of AI pentesting autonomy, running from simple scripted automation at Level 1 to fully unsupervised, goal-directed autonomous operation at Level 5. The paper's assessment of where current tools actually sit: Level 3 to 4. That means AI systems today can plan and sequence known attack techniques, adapt within a defined scope, and complete multi-step tasks without human guidance at each step. What they cannot do reliably is discover genuinely novel vulnerabilities, execute deceptive or adaptive campaigns against a live defender, or operate across unfamiliar environments without significant failure.

CAI's benchmark numbers led Vilches six months later to describe the trajectory toward cybersecurity superintelligence as "no longer speculative." Other researchers have taken a more measured approach. The Excalibur paper states directly that "fully autonomous penetration testing remains distant." A systematic literature review published in December 2025, covering 58 peer-reviewed studies on AI in penetration testing, found exactly one real-world deployment at scale: ESA's PenBox, used in space mission ground systems. The academic literature is extensive. The transition to production is not.

Both positions miss the more operationally significant point. You do not need Level 5 autonomy to present a serious threat if you are running Level 3 capabilities across hundreds of targets at once. A single AI agent executing known attack chains is a manageable problem. A thousand instances of that same agent, running in parallel against every exposed service in your environment simultaneously, is a categorically different one. The threat is not one brilliant AI conducting a sophisticated zero-day campaign against your organization. It is a mediocre attacker that never sleeps and probes everything at once.

The relevant threat model is not about what AI can do against one target. It is about what AI can do against all of your targets, simultaneously, around the clock.

Where AI performs across the attack chain, and where it does not

The benchmark literature from 2024 through early 2026 produces a consistent finding: AI offensive capability is not uniformly distributed across the attack chain. It is concentrated at the front end, degrades through the middle, and is effectively absent at the back.

Reconnaissance is the strongest phase by a significant margin. PentestAgent reports 100% task completion across all LLM backbones for vulnerability analysis and intelligence gathering. Every major framework tested in the literature, including AutoPentester, RapidPen, Excalibur, and xOffense, achieves near-ceiling performance on scanning and enumeration. Dedicated recon tools have pushed further into specific sub-tasks: Subwiz, released in November 2024, applies fine-tuned language models specifically to subdomain enumeration and has demonstrated meaningfully higher discovery rates than traditional wordlist-based tools on the same targets.

This is where parallelization has its most dramatic effect. Reconnaissance is largely pattern-matching and data aggregation, tasks that play to language model strengths. But the advantage is not just accuracy. It is doing it across your entire external surface in minutes rather than the days or weeks a human team would need to achieve the same coverage. Every subdomain enumerated, every CVE matched, every service fingerprinted, all running concurrently.

Initial access and exploitation present a sharper split. Against known vulnerabilities with CVE descriptions, GPT-4 achieves an 87% exploitation rate in controlled conditions. Excalibur reaches 85% on 104 real-world web vulnerability challenges. The numbers should be taken in context: in 2025, CVE-Bench found state-of-the-art agents exploit only 13% of critical-severity real-world web application CVEs in production environments. AutoPenBench pegs fully autonomous success at 21%, rising to 64% with human assistance. The gap between a known vulnerability with a description and a real-world CVE in a production context is still wide.

But parallelization narrows it operationally. A 13% success rate against one target is low. A 13% success rate run simultaneously against every known CVE across every exposed service in your environment will find something. Volume compensates for precision when execution is effectively free.

Privilege escalation has been demonstrated but is inconsistent without human steering. PenTest2.0 shows multi-turn adaptive privilege escalation on Linux targets using retrieval-augmented generation, but reports sensitivity to prompt structure and semantic drift across turns. HackingBuddyGPT benchmarks GPT-4-Turbo at 33% success on Linux privilege escalation without guidance, rising to 83% with high-level human guidance.

Lateral movement is where the most interesting current work is happening. Excalibur's Active Directory testing, run across five Windows Server VMs in a two-forest, three-domain environment, produced four of five hosts compromised at $28.50 per engagement. The Cochise reproduction study confirmed that every experimental run compromised at least one AD account, at operational costs ranging from $0.10 to $11.64 per run depending on model. The fifth host in Excalibur's test, the forest root domain controller, remained unsolved because of token constraints on a complex multi-step chain.

Defense evasion, persistence, command-and-control, and exfiltration have essentially no demonstrated AI capability. PACEbench found no model could autonomously bypass realistic cyber defense layers. The December 2025 systematic literature review of 58 peer-reviewed studies confirmed these phases receive almost no research attention. This is the most operationally significant gap: the capabilities that distinguish a pentest from a real intrusion remain firmly human-dependent.

Two consistent failure modes appear across systems. Type A failures stem from capability gaps, either missing tools or incorrect syntax, and are addressable through engineering. Type B failures, weaknesses in planning and state management, result in agents committing too early to unproductive branches, failing to estimate task difficulty in real time, and exhausting context pursuing dead ends. These are harder to resolve. Blind SQL injection records 0% success in MAPTA's testing despite 76.9% overall performance on that benchmark.

The ecosystem: MCP and a weaponization clock running on weeks

The most significant infrastructure development in this period is not a new model or framework. It is a protocol. Model Context Protocol, released by Anthropic in late 2024 and adopted by Amazon, Microsoft, Google, and OpenAI by mid-2025, has become the standardized interface connecting LLM reasoning to real-world offensive tools. Kali Linux now ships an official MCP server package, bridging AI agents to nmap, Metasploit, SQLMap, Gobuster, and Hydra. HexStrike AI connects LLMs to over 150 cybersecurity tools via MCP.

The consequence is composability at scale. Any MCP-compatible model can orchestrate any MCP-compatible security tool without custom integration work. Contributors building new capabilities need only expose them as MCP servers. And critically, MCP is the infrastructure layer that makes parallelization operationally trivial. An agent does not need to be built with parallel execution in mind. It simply orchestrates multiple tool calls concurrently through a standardized interface.

CyberStrikeAI illustrates what this composability looks like when it reaches production. Published on GitHub on November 8, 2025, by a developer with documented ties to Knownsec 404, a Chinese security firm with confirmed affiliations to China's Ministry of State Security, the tool integrates over 100 offensive tools with AI orchestration, YAML-defined attack recipes, and native MCP support across HTTP, stdio, and SSE transport modes. The developer's Git history shows a December 19, 2025 submission of CyberStrikeAI to Knownsec 404's Starlink Project, accepted via GitHub issue. On January 5, 2026, the developer added a CNNVD 2024 Vulnerability Reward Program Level 2 Contribution Award to their GitHub profile, then removed it. Git records preserved both actions. CNNVD is overseen by the 13th Bureau of China's MSS.

Within 37 days of first observed deployment, Team Cymru identified 21 unique IP addresses running CyberStrikeAI infrastructure across China, Singapore, Hong Kong, the United States, Japan, and Switzerland. By January and February 2026, the tool was confirmed in active attacks against Fortinet FortiGate appliances across 55 countries, with over 600 devices compromised.

The attack workflow was not novel: nmap and masscan for discovery, Metasploit modules for exploitation. What was novel was parallelized, autonomous orchestration of that workflow at scale. Not one operator working through targets sequentially. Automated agents running the same playbook against every reachable device simultaneously, with no human in the loop for each step. Six hundred devices across 55 countries is not a targeted campaign. It is what happens when a standard offensive workflow runs in parallel against everything it can reach.

Metasploit, first released in 2003, took years to become standard attacker infrastructure. Cobalt Strike, released in 2012, followed a similar curve. CyberStrikeAI moved from public repository to confirmed attack campaign in under two months. That compression reflects the maturity of the underlying toolchain, the low barrier to deployment via MCP, and an ecosystem in which contributors can build offensive capabilities faster than defenders can catalogue them.

The skill floor has moved, and so has the population of potential attackers

In the autumn of 2025, a research team at a European university enrolled an undergraduate student with no background in cybersecurity or capture-the-flag competitions. The student was given access to CAI, the open-source cybersecurity AI framework, and entered Austria's national CTF competition. By the end of the competition, the student's AI-assisted performance fell within the range of intermediate competitors who had spent months preparing. The paper documenting the experiment, published in February 2026, describes the student completing challenges in binary exploitation, reverse engineering, and web vulnerabilities with no prior exposure to any of those domains.

This result is worth examining carefully, because it is not primarily a story about one student or one competition. It is a data point on a structural shift in the barrier to entry for offensive security capability.

The student did not outperform senior participants, and AI certainly did not elevate them to expert-level. What changed is the apprenticeship period, the months or years of accumulated domain knowledge that previously separated a motivated beginner from someone capable of executing non-trivial attack chains. AI compressed that gap significantly in a controlled setting. The research identifies a specific mechanism: AI handles tool selection, command syntax, and output interpretation, freeing the human operator to focus on high-level task direction rather than technical execution.

The same class of AI tooling that enabled an Austrian undergraduate to develop offensive security skills is now broadly accessible. WhiteRabbitNeo, now rebranded as Deep Hat, has released a second major version that attracted over 140,000 downloads across versions. The new version was trained on 1.7 million offensive and defensive cybersecurity samples, up from 100,000 in its original release. The November 2025 Knownsec data leak documented a target database maintained by state-aligned contractors covering 379 million IP addresses, 3.5 million domains, and 24,000 organizations.

The cybersecurity industry has spent years articulating a skills gap measured in millions of unfilled positions globally. AI as a force multiplier for defenders is a legitimate response to that gap. The Austrian CTF study was framed exactly that way: AI could help close the talent shortage by compressing training timelines and making professional-grade analysis accessible to less experienced analysts. The complication is that the attack surface of the skills gap runs in both directions: lowering the barrier to defensive capability and lowering the barrier to offensive capability are the same intervention.

Nation-states, open-source tools, and the deniability architecture

The CyberStrikeAI case is not an isolated incident. It represents an emerging pattern in how nation-state actors relate to the open-source AI offensive ecosystem, and that pattern has strategic implications that go beyond any single tool.

In November 2025, Anthropic disrupted what it described as the first publicly confirmed AI-orchestrated cyber espionage campaign. The operation, attributed to a Chinese state-sponsored group designated GTG-1002 by threat researchers, targeted approximately 30 organizations across technology, finance, chemicals, and government sectors globally. The AI component handled reconnaissance, vulnerability discovery, exploitation, lateral movement, credential harvesting, and data exfiltration. Human operators served in a supervisory role. Anthropic's own disclosure noted that the AI in use frequently overstated findings and occasionally fabricated data, claiming credentials that did not work. Even with those limitations, the operation demonstrated that AI-orchestrated intrusion at scale is no longer a theoretical concern. Google's Threat Intelligence Group separately confirmed APT31's use of HexStrike AI with Gemini for automated vulnerability discovery in February 2026.

Russia's posture has evolved differently. In mid-2025, CERT-UA observed APT28 deploying malware that queried an AI model in real time to determine next actions once inside compromised Ukrainian networks, effectively outsourcing command logic to an LLM rather than executing pre-coded instructions. This is a qualitatively different integration than using AI for pre-attack preparation. CrowdStrike's 2026 Global Threat Report documented FANCY BEAR's LAMEHUG, described as LLM-enabled malware automating reconnaissance and document collection. Ukraine's SSSCIP recorded 3,018 cyber incidents in the first half of 2025, up from 2,575 in the second half of 2024, with AI-generated malware showing what incident responders described as clear signs of AI creation.

What makes the CyberStrikeAI case analytically interesting is the architecture of the relationship between the developer and the state. The tool is open-source. Attribution to state purposes is inferential from Git history and organizational affiliations, not direct. The developer submitted the tool to Knownsec 404's Starlink Project through a public GitHub issue. The CNNVD contribution award, briefly visible then removed, suggests a legitimate vulnerability research relationship with the state apparatus. None of this constitutes evidence of a deliberate seeding program, and it would be wrong to assert that it does. But it describes a model in which state-aligned individuals publish dual-use offensive tools publicly, those tools diffuse globally through normal open-source channels, and the state maintains plausible distance from the resulting proliferation.

Kaspersky's Global Research and Analysis Team confirmed in 2025 that APT actors are increasingly incorporating open-source tooling into operations, in some cases abandoning custom and private toolsets. The incentive is clear: open-source tools carry attribution ambiguity, receive community development contributions that improve capability, and impose no cost. For defenders, the implication is that the threat model can no longer distinguish meaningfully between tools developed for the open-source community and tools that happen to be available to nation-state operators.

What the next 18 months likely look like

Projecting AI offensive capability requires distinguishing between the trajectory of benchmark performance, which is unambiguous, and the structural constraints that have resisted scaling, which are real.

The benchmark trend is steep and shows no plateau. A March 2026 paper measuring AI agent progress on multi-step cyber attack scenarios found that performance scales log-linearly with inference-time compute: moving from 10 million to 100 million tokens yields gains of up to 59% on complex tasks. On corporate network scenarios, average steps completed rose from 1.7 for GPT-4o in August 2024 to 9.8 for Opus 4.6 in February 2026. Cybench performance went from 17.5% to 93% across approximately 18 months. ARTEMIS, a system from Stanford, CMU, and Gray Swan AI evaluated in December 2025, placed second overall against human penetration testers on a live enterprise network of approximately 8,000 hosts, finding 9 valid vulnerabilities at 82% precision for $18 per hour versus $60 per hour for the human professionals it outperformed.

The zero-day economics are the most consequential shift on the horizon. In January 2026, an AI agent system found all 12 zero-day vulnerabilities in a new OpenSSL release. A separate AI agent swarm identified over 100 exploitable kernel vulnerabilities across AMD, Intel, NVIDIA, Dell, Lenovo, and IBM drivers over 30 days at a total cost of $600, roughly $4 per bug. DARPA's AI Cyber Challenge teams collectively found 54 new vulnerabilities in 4 hours. Google's Threat Intelligence Group tracked 90 zero-days exploited in the wild in 2025, up from 78 in 2024, with enterprise software targets at all-time highs. The time-to-exploit has compressed from 756 days in 2018 to 4 hours in 2024. The current trajectory suggests sub-day exploitation of novel vulnerabilities within the 12 to 24-month window.

When AI tools move from chaining known vulnerabilities to discovering novel ones, the nature of the defensive problem changes. Patch-based programs assume that a CVE will be published before exploitation occurs at scale. That assumption already fails for roughly 32% of exploited vulnerabilities, which show evidence of active exploitation before CVE publication. If AI-assisted zero-day discovery becomes sufficiently cheap that motivated actors can routinely find and exploit novel vulnerabilities before patches exist, the CVE-centric model of vulnerability management stops functioning as a primary control.

AI-versus-AI dynamics are beginning to emerge in controlled research settings, and the structural picture is not comfortable for defenders. Mayoral-Vilches and colleagues ran autonomous agents against each other in 23 attack-and-defense CTF battlegrounds. Under unconstrained conditions, defensive agents patched vulnerabilities successfully 54.3% of the time versus 28.3% for offensive initial access, a result that appears to favor defense. Under operational constraints, maintaining service availability and preventing all intrusions simultaneously, that defensive advantage disappeared entirely.

The reason maps directly to the parallelization problem. Offense has cheaper verification than defense. An exploit either works or it does not. Run a thousand of them simultaneously across every exposed surface and you only need one to succeed. Defense has to hold everywhere, all the time, while maintaining availability. That asymmetry does not improve with better models. It gets worse. As AI offensive tools become faster and more parallel, the defender's task scales linearly with the attack surface while the attacker's cost stays flat.

The direction of travel is clear. AI pentesting capabilities are improving quickly, and the constraints that remain are not strong enough to offset the rate of progress. Security programs built around known vulnerabilities, periodic testing, and delayed response cycles assume a world where the attacker works sequentially and slowly. That world is ending. The attacker now works in parallel, continuously, across everything you expose.

Programs that depend on time and visibility gaps between discovery and exploitation will become obsolete. Security programs must be grounded in continuous evidence under real conditions rather than inferred from coverage or policy.

The full catalog of 70 AI offensive security tools, the CyberStrikeAI weaponization timeline, and the benchmark data behind this analysis are available in Hadrian's open-source AI attack toolkit factsheet and infographic. No gate, no form.