Scoring Dimensions — Oathe Docs

The six security dimensions that make up the trust score and what each one evaluates.

Overview

The Oathe trust score (0-100) is a weighted composite of six security dimensions. Each dimension evaluates a distinct category of risk observed during the behavioral audit. Findings reduce individual dimension scores, and the weighted combination produces the final trust score.

Dimension Table

Dimension	API Field	Description
Prompt Injection	`prompt_injection`	Attempts to manipulate the host agent via injected instructions
Data Exfiltration	`data_exfiltration`	Unauthorized transmission of data to external endpoints
Code Execution	`code_execution`	Unexpected shell commands, eval calls, or subprocess spawning
Clone Behavior	`clone_behavior`	Behavior that modifies the host system beyond the skill’s directory
Canary Integrity	`canary_integrity`	Token tamper detection
Behavioral Reasoning	`behavioral_reasoning`	Holistic AI judgment of overall behavioral patterns

Each dimension starts at a perfect score of 100 and is reduced by the score_impact of each finding attributed to it. Dimensions are weighted differently — some carry more influence on the final score than others.

Dimension Details

Prompt Injection

Oathe checks whether the skill’s tool descriptions, error messages, or runtime output contain instructions aimed at the host LLM. This includes patterns like “ignore previous instructions,” hidden system prompts embedded in tool metadata, and indirect injection via returned content. Mapped threat patterns: T3.1, T3.2.

Data Exfiltration

Evaluates whether the skill transmits data to external endpoints that are not part of its declared functionality. This covers HTTP/HTTPS requests, DNS-based channels, and encoded payloads. Mapped threat patterns: T1.1, T1.5.

Code Execution

Covers any attempt to execute arbitrary code outside the skill’s declared functionality. This includes spawning child processes, calling eval or exec, writing to temp directories and executing the output, and loading dynamic libraries. Mapped threat patterns: T2.1, T2.2, T5.1, T5.2, T6.1.

Clone Behavior

Detects behavior that modifies the host system beyond the skill’s own directory. This dimension catches file drops, unauthorized writes to system paths, and modifications outside the skill’s declared scope. Mapped threat patterns: T4.1.

Canary Integrity

Evaluates token tamper detection. There are no T-code threat patterns mapped to this dimension — it is scored entirely through AI behavioral analysis.

Behavioral Reasoning

A catch-all dimension where the AI model evaluates the skill’s overall behavior holistically. This covers edge cases and novel threats that do not fit neatly into the other five dimensions. There are no T-code threat patterns mapped here — the score is determined by AI judgment alone.

API Response

Per-dimension scores are returned in the audit report under category_scores:

{
  "trust_score": 72,
  "category_scores": {
    "prompt_injection": 75,
    "data_exfiltration": 100,
    "code_execution": 85,
    "clone_behavior": 100,
    "canary_integrity": 100,
    "behavioral_reasoning": 90
  }
}

The trust_score field is pre-computed — you do not need to calculate it yourself.