Is maorun/prompt-injection-guard safe?
https://github.com/openclaw/skills/tree/main/skills/maorun/prompt-injection-guard
The prompt-injection-guard skill installs cleanly with no malicious code, executable scripts, or suspicious network connections. Its risk lies entirely in its instructional content: an overly broad auto-trigger, unconditional output filtering that could suppress legitimate agent responses, and undirected user input logging instructions that lack a controlled destination. The skill also chains to three unreviewed companion skills, creating a latent supply-chain risk.
Category Scores
Findings (8)
HIGH Overly broad auto-trigger activates on common technical terms -22 ▶
The skill sets auto_trigger: true with trigger keywords 'プロンプト|prompt|インジェクション|injection|攻撃|attack'. The word 'prompt' appears constantly in legitimate developer conversations (e.g., 'prompt the user for input', 'system prompt design'). When triggered, all output-filtering and input-logging rules apply to the entire conversation turn, causing unintended interference.
HIGH Output filtering suppresses legitimate agent responses -10 ▶
The skill unconditionally instructs the agent to block any output containing passwords, API keys, seed phrases, or system prompt content, responding only with 'セキュリティ上の理由で表示できません'. In agents used as password managers, developer tools, or admin assistants, this silently prevents task completion without informing the user why, degrading agent utility and potentially masking legitimate outputs as security violations.
MEDIUM User input logging without controlled destination -24 ▶
The skill instructs the agent to maintain a log of detected attack patterns, portions of user input, operation results, and timestamps. No safe destination, access control, or retention limit is defined. If the agent has persistent logging capability or this log is accessible via the skill's companion infrastructure, user message content could be observed or extracted by the skill author.
MEDIUM Chains to three unreviewed companion skills -20 ▶
The skill explicitly lists human-security, security-review, and threat-model as companion skills for coordinated detection. If any companion skill is malicious, the prompt-injection-guard—having established itself as a trusted security authority—could serve as a legitimate entry point that amplifies or enables harm via skill-to-skill interaction.
MEDIUM Over-classification creates friction for legitimate requests -18 ▶
The skill classifies ロールプレイ (role-play), ふりをして (pretend), and 〇〇として振る舞って (behave as) as medium-risk patterns requiring explicit user confirmation before proceeding. These are common, benign requests in creative writing, education, and entertainment. The over-broad classification degrades agent usability and could cause the agent to misrepresent legitimate requests as potential attacks to the user.
MEDIUM Unresolved {AGENT_NAME} template variable -6 ▶
The section header '{AGENT_NAME}固有の防御' contains an unresolved template placeholder. This indicates the skill was designed for a specific agent deployment and was not properly sanitized before distribution. Depending on how the hosting platform resolves template variables at injection time, this could inadvertently interpolate agent identity or context information into the prompt.
LOW Canary files read during audit window -12 ▶
Filesystem and auditd monitoring recorded OPEN/ACCESS/CLOSE_NOWRITE events against all six canary files (.env, .ssh/id_rsa, .aws/credentials, .npmrc, .docker/config.json, GCP credentials) both 5 seconds before the git clone (04:16:44, likely audit baseline) and 12 seconds after installation completed (1771906621.790, likely audit post-verification). The integrity check confirms no modification or external exfiltration occurred. The temporal pattern is consistent with audit infrastructure activity rather than skill-induced access.
INFO Clean installation — GitHub-only network activity -12 ▶
All network traffic during installation was limited to DNS resolution via local resolver (10.0.2.3:53) and HTTPS to GitHub (140.82.121.4:443). No connections to attacker-controlled infrastructure, no unexpected process spawning beyond standard git operations, and no filesystem changes outside the designated skill directory were detected.