← All posts

The Oathe Threat Model: 10 Ways AI Agent Skills Go Wrong

We classified the attack vectors we see most often across agent skills. Here are all 10, organized by risk dimension.

Every time we audit an AI agent skill, we see the same failure modes. Different codebases, different authors, different ecosystems — but the same patterns of risky behavior showing up again and again.

So we decided to write them down.

Introducing the Open Threat Classification

The Open Threat Classification (OTC) is our public, citable taxonomy of the threats we observe across AI agent skills — MCP servers, plugins, tool-use integrations, all of it. It’s licensed CC-BY-4.0, published on GitHub, and designed to give the community a shared vocabulary for talking about agent-level risk.

We built the OTC because the security conversation around AI agents is still too vague. People say “prompt injection” and “data exfiltration” but mean different things. Threat taxonomies exist for traditional software (MITRE ATT&CK, OWASP Top 10), but nothing maps cleanly to the behavioral risks that emerge when autonomous agents run third-party skills at runtime. If you want the full argument for why agent security is a behavioral problem, start there.

The OTC fixes that. Ten threat codes. Six scoring dimensions. One framework that anyone can reference.

The 6 Scoring Dimensions

When Oathe audits a skill, we score it across six behavioral dimensions. Each dimension is weighted to reflect its relative severity in practice:

  • Prompt Injection — Attempts to override agent instructions. A compromised instruction set undermines everything else.
  • Data Exfiltration — Unauthorized data transmission. If a skill is phoning home with your data, nothing else matters much.
  • Code Execution — Unauthorized process execution, filesystem access, and resource abuse. The classic “it’s running things it shouldn’t be running” category.
  • Clone Behavior — Modifying the host system beyond the skill’s own directory. Skills should stay in their lane.
  • Canary Integrity — Token tamper detection.
  • Behavioral Reasoning — Holistic AI judgment of overall behavioral patterns.

These dimensions roll up into a composite trust score. VendingBench results show behaviors like deception, data manipulation, and reward hacking emerging in frontier models when given autonomy — exactly the kind of risks these dimensions are designed to catch. Our audit of 1,620 agent skills validated this framework at scale: 88 skills flagged across all six dimensions, from credential harvesting to identity hijack.

The 10 T-Codes

Each threat code maps to a specific scoring dimension. Here’s the full taxonomy, grouped by what they target.

Data Exfiltration

T1.1 — Direct Exfiltration: Detects attempts to send data to external endpoints via HTTP, DNS, or raw sockets.

T1.5 — Credential Harvest: Detects access to credential and secret files.

Code Execution

T2.1 — Filesystem Escape: Detects path traversal, symlink attacks, and access outside skill directory.

T2.2 — Process Spawning: Detects suspicious process execution and privilege escalation attempts.

Prompt Injection

T3.1 — Prompt Injection: Detects prompt injection patterns in skill content.

T3.2 — Manifest Spoofing: Detects package.json install scripts and mismatched metadata.

Clone Behavior

T4.1 — File Drops: Detects files created outside the skill directory during install.

Resource Abuse (code_execution)

T5.1 — Cryptomining: Detects cryptocurrency mining activities or binaries.

T5.2 — Denial of Service: Detects fork bombs, infinite loops, disk fills, and resource exhaustion.

Evasion (code_execution)

T6.1 — Environment Sensing: Detects environment fingerprinting, analysis evasion, and conditional behavior.

What the OTC Tells You (and What It Doesn’t)

The taxonomy is deliberately open. It tells you what Oathe looks for — the categories, the threat surface, the risk dimensions. It gives researchers and developers a shared framework for discussing agent-level threats.

What it does not tell you is how Oathe detects these behaviors. Our detection logic is proprietary. The OTC is the map; the scanner is the territory.

This is intentional. Publishing exact detection heuristics would hand adversaries a bypass guide. The taxonomy gives the community enough to build on without compromising the effectiveness of the tooling.

Why Open-Source the Taxonomy?

Because the threat landscape is bigger than any one company. We want security researchers to cite these T-codes in their work. We want agent framework developers to build around these categories. We want enterprises to use the OTC as a checklist when evaluating skills — whether through a pre-install check or a CI/CD gate.

A shared threat model makes the whole ecosystem safer. That’s the point.

The OTC is versioned and will evolve as we observe new attack patterns. If you’re seeing behaviors in the wild that don’t fit the current taxonomy, open an issue on the repo. We want this to be a community effort.

Get Involved

The OTC repo is live at github.com/oathe-ai/otc. Star it, fork it, cite it. If you’re building agent tooling or researching AI security, this is the starting point.

If your skill passes, add a trust badge to your README so users can see the score at a glance. You can also check any skill directly from your agent by running npx oathe-mcp.

And if you want to see these threat codes scored against a real skill — run an audit on Oathe. It takes under two minutes, and it’s free while we’re in beta.