You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

High-privilege LLM agents that autonomously process external documentation are increasingly trusted to automate tasks by reading and executing project instructions, yet they are granted terminal access, filesystem control, and outbound network connectivity with minimal security oversight. We identify and systematically measure a fundamental vulnerability in this trust model, which we term the \emph{Trusted Executor Dilemma}: agents execute documentation-embedded instructions, including adversarial ones, at high rates because they cannot distinguish malicious directives from legitimate setup guidance. This vulnerability is a structural consequence of the instruction-following design paradigm, not an implementation bug. To structure our measurement, we formalize a three-dimensional taxonomy covering linguistic disguise, structural obfuscation, and semantic abstraction, and construct \textbf{ReadSecBench}, a benchmark of 500 real-world README files enabling reproducible evaluation. Experiments on the commercially deployed computer-use agent show end-to-end exfiltration success rates up to 85\%, consistent across five programming languages and three injection positions. Cross-model evaluation on four LLM families in a simulation environment confirms that semantic compliance with injected instructions is consistent across model families. A 15-participant user study yields a 0\% detection rate across all participants, and evaluation of 12 rule-based and 6 LLM-based defenses shows neither category achieves reliable detection without unacceptable false-positive rates. Together, these results quantify a persistent \emph{Semantic-Safety Gap} between agents' functional compliance and their security awareness, establishing that documentation-embedded instruction injection is a persistent and currently unmitigated threat to high-privilege LLM agent deployments.

翻译：被赋予高权限、可自主处理外部文档的LLM智能体正日益受信用于通过读取并执行项目指令来自动化任务，然而它们被授予终端访问、文件系统控制及出站网络连接能力，却缺乏必要的安全监督。我们识别并系统化测量了这种信任模型中的一个根本性脆弱性，称之为“可信执行器困境”：智能体以高概率执行文档中嵌入的指令（包括恶意指令），因为它们无法区分恶意指令与合法的设置指导。该脆弱性是指令遵循设计范式的结构性后果，而非实现缺陷。为构建测量框架，我们形式化了一个涵盖语言伪装、结构混淆和语义抽象的三维分类体系，并构建了包含500个真实世界README文件的基准测试集**ReadSecBench**，以实现可复现的评估。在商业部署的计算机使用智能体上的实验显示，端到端数据外泄成功率高达85%，且在五种编程语言和三种注入位置中表现一致。在模拟环境中对四个LLM系列进行的跨模型评估证实，不同模型系列对注入指令的语义遵循行为具有一致性。一项包含15名参与者的用户研究显示所有参与者的检测率为0%，而对12种基于规则和6种基于LLM的防御方案的评估表明，两类方案均无法在可接受的误报率下实现可靠检测。综合来看，这些结果量化了智能体功能遵循性与其安全认知之间持续存在的“语义安全鸿沟”，确证了文档嵌入式指令注入是对高权限LLM智能体部署的持续性且当前未缓解的威胁。