Scoring Dimensions
The six security dimensions that make up the trust score and what each one evaluates.
Overview
The Oathe trust score (0-100) is a weighted composite of six security dimensions. Each dimension evaluates a distinct category of risk observed during the behavioral audit. Findings reduce individual dimension scores, and the weighted combination produces the final trust score.
Dimension Table
| Dimension | API Field | Description |
|---|---|---|
| Prompt Injection | prompt_injection | Attempts to manipulate the host agent via injected instructions |
| Data Exfiltration | data_exfiltration | Unauthorized transmission of data to external endpoints |
| Code Execution | code_execution | Unexpected shell commands, eval calls, or subprocess spawning |
| Clone Behavior | clone_behavior | Behavior that modifies the host system beyond the skill’s directory |
| Canary Integrity | canary_integrity | Token tamper detection |
| Behavioral Reasoning | behavioral_reasoning | Holistic AI judgment of overall behavioral patterns |
Each dimension starts at a perfect score of 100 and is reduced by the score_impact of each finding attributed to it. Dimensions are weighted differently — some carry more influence on the final score than others.
Dimension Details
Prompt Injection
Oathe checks whether the skill’s tool descriptions, error messages, or runtime output contain instructions aimed at the host LLM. This includes patterns like “ignore previous instructions,” hidden system prompts embedded in tool metadata, and indirect injection via returned content. Mapped threat patterns: T3.1, T3.2.
Data Exfiltration
Evaluates whether the skill transmits data to external endpoints that are not part of its declared functionality. This covers HTTP/HTTPS requests, DNS-based channels, and encoded payloads. Mapped threat patterns: T1.1, T1.5.
Code Execution
Covers any attempt to execute arbitrary code outside the skill’s declared functionality. This includes spawning child processes, calling eval or exec, writing to temp directories and executing the output, and loading dynamic libraries. Mapped threat patterns: T2.1, T2.2, T5.1, T5.2, T6.1.
Clone Behavior
Detects behavior that modifies the host system beyond the skill’s own directory. This dimension catches file drops, unauthorized writes to system paths, and modifications outside the skill’s declared scope. Mapped threat patterns: T4.1.
Canary Integrity
Evaluates token tamper detection. There are no T-code threat patterns mapped to this dimension — it is scored entirely through AI behavioral analysis.
Behavioral Reasoning
A catch-all dimension where the AI model evaluates the skill’s overall behavior holistically. This covers edge cases and novel threats that do not fit neatly into the other five dimensions. There are no T-code threat patterns mapped here — the score is determined by AI judgment alone.
API Response
Per-dimension scores are returned in the audit report under category_scores:
{
"trust_score": 72,
"category_scores": {
"prompt_injection": 75,
"data_exfiltration": 100,
"code_execution": 85,
"clone_behavior": 100,
"canary_integrity": 100,
"behavioral_reasoning": 90
}
}
The trust_score field is pre-computed — you do not need to calculate it yourself.