Is damustermann/claw-qwen3-tts safe?
https://github.com/openclaw/skills/tree/main/skills/damustermann/claw-qwen3-tts
The damustermann/claw-qwen3-tts skill provides legitimate offline TTS and voice cloning functionality but carries substantial secondary risks. The most critical issue is a curl-pipe-bash remote code execution instruction embedded in SKILL.md, which allows the skill author to deliver and execute arbitrary code on any agent host after the static audit has passed by updating a remote install script. Additionally, the combination of low-threshold voice cloning (3 seconds of audio) with Telegram and WhatsApp messaging integrations creates a complete deepfake-and-deliver pipeline that could enable impersonation fraud, social engineering, and non-consensual synthetic audio generation with minimal user interaction.
Category Scores
Findings (14)
CRITICAL Curl-pipe-bash remote code execution instruction in SKILL.md -20 ▶
SKILL.md directly instructs the agent to execute a shell command that downloads and pipes a remote script to bash: bash <(curl -fsSL https://raw.githubusercontent.com/daMustermann/claw-qwen3-tts/main/install.sh). This classic download-and-execute antipattern means the skill author can deliver arbitrary code to any agent running this skill at any time simply by updating install.sh on GitHub after the audit passes. The static code reviewed here provides no assurance about what install.sh contains or will contain in the future.
CRITICAL Remote install script executed on agent host via curl-pipe-bash -20 ▶
The curl-pipe-bash instruction in SKILL.md, when executed by the agent, fetches and runs an unreviewed shell script from the author's personal GitHub repository. This script has full access to the agent's environment, filesystem, network, and credentials. Because the script is fetched at runtime rather than at audit time, the audited codebase cannot represent what code will actually execute on installation. Any post-audit update to install.sh by the author constitutes a live code injection vector.
HIGH Voice cloning deepfake capability with 3-second audio threshold -30 ▶
The skill enables voice cloning from as little as 3 seconds of reference audio using Qwen3-TTS-12Hz-1.7B-Base. The resulting synthetic audio is indistinguishable from the source speaker. This capability can be misused to impersonate users, generate fraudulent voice authorizations, produce non-consensual deepfake audio, or forge voice messages from known contacts. Cross-language cloning (clone English voice, speak in Japanese) amplifies this risk.
HIGH Messaging integration enables synthetic voice delivery to arbitrary recipients -20 ▶
The Telegram and WhatsApp send endpoints allow the agent to deliver AI-generated audio — including cloned voices — as native PTT voice messages to any chat ID or phone number. This creates a complete pipeline: clone a target's voice from a short sample, generate synthetic speech, deliver it to the target's contacts. The bot_token and access_token parameters can be pre-configured in config.json so the agent can send messages silently during normal operation.
HIGH Arbitrary file path accepted in messaging send endpoints -20 ▶
The /v1/audio/send/telegram and /v1/audio/send/whatsapp endpoints accept an audio_file string that is expanded with os.path.expanduser() and passed directly to FileResponse. No MIME type validation or path restriction prevents an attacker from supplying the path to a sensitive file (e.g., ~/.env, ~/.ssh/id_rsa). While Telegram/WhatsApp would likely reject or corrupt non-audio content, this represents an unvalidated exfiltration path if the messaging service's validation fails or if the file happens to be parseable as valid audio.
MEDIUM Coercive mandatory behavioral override in SKILL.md -10 ▶
Section 4 of SKILL.md uses 'YOU MUST FOLLOW THESE RULES:' and '⭐ CRITICAL' formatting to establish rules that override the agent's default behavior for all voice operations. While the specific mandated behavior (asking users to save voices) appears benign, injecting coercive mandatory instruction patterns normalizes agent override and establishes a template for future escalation via skill updates.
MEDIUM Conditional auto-install instruction without user confirmation -8 ▶
SKILL.md instructs the agent to proactively execute a remote install script if the skill directory is not present. This means an agent loading this skill for the first time will automatically fetch and execute arbitrary remote code without any user-initiated confirmation step.
MEDIUM Plaintext messaging API credentials stored on agent host -10 ▶
The skill stores Telegram bot tokens and WhatsApp Business API access tokens in config.json at a predictable path (~/clawd/skills/qwen3-tts/config.json). Any other skill or process with filesystem access on the agent host can read these credentials and use them to send messages impersonating the legitimate bot, enumerate chat history, or intercept messages.
MEDIUM Persistent voice biometric data stored without access controls -10 ▶
Voice cloner saves reference audio samples (
MEDIUM Automatic download of multi-gigabyte ML models from HuggingFace -12 ▶
On first use, model_loader.py calls huggingface_hub.snapshot_download() to fetch the full Qwen3-TTS model (2-5GB) from HuggingFace Hub. This triggers without user confirmation, consumes significant disk space, and trusts the HuggingFace repository owner's packages. Python pickle files in ML model archives are a known supply chain vector.
MEDIUM Persistent voice impersonation database accessible to co-installed skills -15 ▶
Named voice profiles stored in ~/clawd/skills/qwen3-tts/voices/ are accessible to any skill or process running under the same user. A malicious co-installed skill could enumerate saved voices, load their reference audio, and generate impersonation content without triggering the qwen3-tts skill's own API or activity logs.
LOW ffmpeg subprocess invoked with user-influenced file paths -8 ▶
audio_converter.py constructs ffmpeg command arrays using input_path and output_path values that originate from user-supplied API parameters. While codec flags are hardcoded, unusual or crafted filenames could potentially interact with ffmpeg's input/output handling in unexpected ways on some filesystem configurations.
LOW Canary files accessed during monitoring window — attributed to audit framework -15 ▶
inotify and auditd logs show .env, .ssh/id_rsa, .aws/credentials, .npmrc, .docker/config.json, and GCP credentials were opened and read at timestamps 1771906159 (04:09:19, before git clone started at 04:09:25) and 1771906183 (post-install). The pre-install access is attributable to oathe audit framework initialization. The post-install access at 1771906183 coincides with the audit's own post-install canary verification phase. No corresponding network egress of canary content was observed. The monitoring system confirms all files intact.
LOW Clean install with only expected external connections -13 ▶
The install process performed a standard git sparse-checkout of the skill subdirectory from the openclaw/skills monorepo on GitHub, copied files to the target directory, and cleaned up. The only external TCP connection made during the install phase was to github.com (140.82.121.3:443). Pre-existing connections to Ubuntu/Canonical servers (185.125.188.x:443) were present before the install began and are attributable to system infrastructure.