Is voice-wake-say safe?
https://clawhub.ai/xadenryan/voice-wake-say
voice-wake-say is a lightweight macOS TTS skill that uses the local say command to speak agent responses when voice input is detected. It contains no executable code, no external network references, and no data exfiltration vectors. The primary concerns are the shell execution surface via say (mitigated by safe printf piping) and persistent behavioral overrides requiring every-message scanning. Installation was completely clean with no suspicious activity.
Category Scores
Findings (8)
MEDIUM Shell execution via `say` command -20 ▶
The skill instructs the agent to pipe text to the macOS say command via shell execution. While the printf '%s' pattern is safe against format string attacks, the agent is being directed to execute shell commands on every voice-triggered message. The spoken text content is derived from the agent's response, which is generally safe, but the shell execution surface exists.
MEDIUM Behavioral override — mandatory message scanning -15 ▶
The skill imposes a persistent behavioral rule requiring the agent to scan every incoming message against the trigger phrase 'User talked via voice recognition'. The emphatic 'CHECK EVERY MESSAGE INDIVIDUALLY' and 'Critical' section impose processing overhead and behavioral constraints that go slightly beyond minimal necessary instructions for a TTS skill.
LOW Task execution order override -8 ▶
The skill mandates that the agent acknowledge with say BEFORE performing the requested task, and optionally speak again after. This alters the agent's natural execution order but is reasonable for a voice UX pattern.
LOW Environment variable references as potential side channel -5 ▶
The skill references $SAY_VOICE and $SAY_RATE environment variables. If an attacker can control these variables, they could potentially influence the say command behavior, though the impact is limited to voice selection and speech rate.
LOW Agent response content spoken aloud in physical environment -10 ▶
The skill causes the agent's response to be spoken aloud via the device speakers. If the agent's response contains sensitive information (e.g., the user asked to read a password file), that information would be broadcast audibly. This is inherent to any TTS skill but worth noting as a physical environment side effect.
LOW Trigger phrase could be spoofed by upstream systems -15 ▶
The trigger phrase 'User talked via voice recognition' is expected to be prepended by a voice recognition system. A malicious upstream component or another skill could inject this phrase to force TTS output when the user didn't use voice input.
INFO Combination risk with other skills -10 ▶
If another skill injects content into the agent's response (via prompt injection or other means), this skill would faithfully speak that content aloud. The physical audio output adds an irreversible dimension to any content injection attack.
INFO Clean installation behavior 0 ▶
Installation produced no network activity, no unexpected processes, no filesystem changes outside the skill directory, and no blocked connections. The /tmp/jiti/ activity is from the installation framework, not the skill.