Is reflect-learn safe?

https://clawhub.ai/stevengonsalvez/reflect-learn

62
CAUTION

The reflect-learn skill is a meta-cognitive tool that permanently modifies agent system files (CLAUDE.md, agent definitions) based on conversation analysis. While it includes human-in-the-loop approval and presents itself as a learning tool, its core mechanism — writing to system instruction files and creating new skills — is functionally equivalent to persistent prompt injection. The unnecessary Bash tool permission and cross-skill poisoning vectors elevate the risk. No malicious intent detected, but the attack surface is significant.

Category Scores

Prompt Injection 45/100 · 30%
Data Exfiltration 80/100 · 25%
Code Execution 70/100 · 20%
Clone Behavior 90/100 · 10%
Canary Integrity 100/100 · 10%
Behavioral Reasoning 35/100 · 5%

Findings (9)

HIGH Persistent agent behavior modification via system file writes -35

The skill's core function is to modify CLAUDE.md and agent definition files under ~/.claude/agents/. These files serve as system-level instructions that govern agent behavior across all future sessions. While the skill frames this as 'learning from corrections', the mechanism is identical to persistent prompt injection — permanently altering how the agent interprets instructions. The agent_mappings.md file provides a detailed roadmap of exactly which files to target and which sections to modify.

HIGH Excessive tool permissions — Bash access unjustified -20

The skill requests access to the Bash tool in its allowed-tools list. A reflection/learning skill that analyzes conversations and edits text files has no legitimate need for shell command execution. Bash access enables arbitrary command execution, file manipulation beyond Edit/Write scope, and potential data exfiltration. This permission should be removed.

MEDIUM Self-replicating skill creation capability -15

The skill can create new skill files at .claude/skills/{name}/SKILL.md. This means the reflection skill can spawn additional skills that will be loaded into the agent's system prompt in future sessions. A malicious actor could craft conversation signals that cause the reflect skill to generate a harmful skill that persists independently.

MEDIUM Cross-skill poisoning vector -30

If another skill outputs text containing correction signal patterns (e.g., 'always', 'never', 'the rule is'), the reflect skill could interpret these as user corrections and propose encoding them as permanent rules. This creates a cross-skill attack vector where a malicious skill can inject persistent instructions through the reflect skill's signal detection.

MEDIUM Auto-reflect mode reduces human oversight -15

The 'reflect on' command enables automatic reflection at session boundaries and before context compaction. These are moments when user attention is lowest, weakening the human-in-the-loop safety guardrail that is the skill's primary defense against misuse.

MEDIUM Conversation content persisted to disk without sanitization -20

The skill stores 'Source Quote' excerpts from conversations in reflection files and learnings.yaml. Over time, this accumulates conversation fragments on disk that may contain sensitive information (API keys mentioned in chat, passwords, personal data). No sanitization or redaction mechanism is described.

LOW Sensitive file access during installation -10

During clone/install, the monitoring detected reads of .env, .aws/credentials, and .openclaw/openclaw.json. While likely attributable to the ClawHub installation framework rather than the skill itself, these accesses expose sensitive credentials to the installation process.

LOW Referenced script does not exist in repository -10

The signal_patterns.md file references 'signal_detector.py' with instructions to run it, but no such file exists in the repository. This is either incomplete documentation or could indicate that the skill expects users to create executable code based on its instructions.

INFO Skill operates as pure markdown instructions with no code guardrails -5

The entire skill is implemented as markdown instructions that rely entirely on the LLM's compliance. There are no programmatic guardrails, validation, or sandboxing. The 'safety guardrails' described are behavioral suggestions to the LLM, not enforced constraints. The LLM could deviate from these guardrails if prompted creatively.