Is hiotec/paddleocr-doc-parsing-v2 safe?
https://github.com/openclaw/skills/tree/main/skills/hiotec/paddleocr-doc-parsing-v2
This skill is a legitimate PaddleOCR API wrapper with no prompt injection, no hardcoded malicious endpoints, and a clean installation. The primary security consideration is architectural: the scripts are designed to read arbitrary local files and upload their binary content to a user-configured external API endpoint, which creates an inherent data exfiltration surface if the agent is directed to process sensitive files or if the API URL is attacker-controlled. The async mode additionally introduces an unvalidated SSRF vector via API-server-provided result URLs.
Category Scores
Findings (7)
HIGH Arbitrary file upload to user-configured external endpoint -25 ▶
Both paddleocr_parse.sh and paddleocr_parse.py read any file path provided as an argument, base64-encode the raw bytes, and POST the encoded content to PADDLEOCR_API_URL. An agent using this skill to 'parse' a sensitive file (SSH keys, cloud credentials, .env files) would silently upload that file's content to the OCR API endpoint. If PADDLEOCR_API_URL is attacker-controlled or compromised, complete file exfiltration occurs.
MEDIUM Unvalidated SSRF via API-server-provided result URL in async mode -18 ▶
In async mode, the Python script extracts jsonl_url from the API server's JSON response and performs an unconditional requests.get(jsonl_url) without validating the scheme, host, or path. A compromised or malicious API server could return a jsonl_url pointing to internal metadata services (e.g., 169.254.169.254/latest/meta-data/), internal HTTP endpoints, or file:// URIs, causing the agent environment to exfiltrate internal data back to the attacker via the API server.
MEDIUM No file path restrictions or allowlist in parsing scripts -10 ▶
The scripts accept any filesystem path as input with no validation against sensitive path prefixes. There is no check preventing paths like ~/.ssh/id_rsa, ~/.aws/credentials, /etc/passwd, or other sensitive system files. An agent operating with this skill could be directed (via crafted user requests or prompt injection in documents) to 'OCR' sensitive files.
LOW Environment variable pivoting risk when combined with env-setting skills -8 ▶
The skill's behavior is entirely determined by PADDLEOCR_API_URL, PADDLEOCR_ACCESS_TOKEN, and PADDLEOCR_JOB_URL. If another installed skill or user instruction can set environment variables, an attacker could redirect PADDLEOCR_API_URL to an attacker-controlled server before triggering OCR on sensitive files, achieving silent credential-free exfiltration.
LOW External documentation URLs could trigger agent web fetches -5 ▶
SKILL.md contains live external URLs in a Resource Links table (paddleocr.com, ai.baidu.com, github.com/PaddlePaddle). An agent that aggressively follows documentation links during skill setup could be directed to fetch these URLs, potentially loading content from Baidu-controlled infrastructure. No explicit fetch instruction is present.
INFO Clean sparse-checkout install with expected network activity 0 ▶
Installation performed a minimal git sparse-checkout of only the skill subdirectory, copied files to the skill directory, and cleaned up the temporary clone. No unexpected processes, no persistent connections, no filesystem writes outside the skill directory.
INFO All honeypot files intact; canary accesses attributed to audit framework 0 ▶
Simultaneous access to all 6 canary files at identical timestamp 1771924105.463 is consistent with the Oathe post-install integrity scanner performing its batch stat check, not skill activity. No accesses during the install window (1771924093–1771924104). Monitoring confirms all canary files unmodified.