网络级提示与用户特征泄露在本地研究智能体中的风险 (Network-Level Prompt and Trait Leakage in Local Research Agents)

We show that Web and Research Agents (WRAs) -- language-model-based systems that investigate complex topics on the Internet -- are vulnerable to inference attacks by passive network observers. Deployment of WRAs \emph{locally} by organizations and individuals for privacy, legal, or financial purposes exposes them to DNS resolvers, malicious ISPs, VPNs, web proxies, and corporate or government firewalls. However, unlike sporadic and scarce web browsing by humans, WRAs visit $70{-}140$ domains per each request with a distinct timing pattern creating unique privacy risks. Specifically, we demonstrate a novel prompt and user trait leakage attack against WRAs that only leverages their network-level metadata (i.e., visited IP addresses and their timings). We start by building a new dataset of WRA traces based on real user search queries and queries generated by synthetic personas. We define a behavioral metric (called OBELS) to comprehensively assess similarity between original and inferred prompts, showing that our attack recovers over 73\% of the functional and domain knowledge of user prompts. Extending to a multi-session setting, we recover up to 19 of 32 latent traits with high accuracy. Our attack remains effective under partial observability and noisy conditions. Finally, we discuss mitigation strategies that constrain domain diversity or obfuscate traces, showing negligible utility impact while reducing attack effectiveness by an average of 29\%.

翻译：本文揭示，基于语言模型的网络与研究智能体（WRAs）——即能在互联网上对复杂主题进行自主调研的智能系统——易受被动网络观察者的推理攻击。出于隐私、法律或财务考量，组织与个人在本地部署WRAs时，其网络活动会暴露给DNS解析器、恶意互联网服务提供商（ISPs）、虚拟专用网络（VPNs）、网络代理以及企业或政府防火墙。然而，与人类零星且稀疏的网络浏览行为不同，WRAs针对每个请求会访问$70{-}140$个域名，并呈现出独特的时间模式，从而产生特有的隐私风险。具体而言，我们提出了一种针对WRAs的新型提示与用户特征泄露攻击，该攻击仅利用其网络级元数据（即访问的IP地址及其时间戳）。我们首先基于真实用户搜索查询和合成人物生成的查询，构建了一个新的WRA网络轨迹数据集。我们定义了一种行为度量指标（称为OBELS），以全面评估原始提示与推断提示之间的相似性，结果表明我们的攻击能够恢复用户提示中超过73%的功能性知识与领域知识。扩展至多会话场景后，我们能够以高准确率恢复32项潜在特征中的多达19项。即使在部分可观测及存在噪声的条件下，我们的攻击依然有效。最后，我们探讨了限制域名多样性或混淆网络轨迹等缓解策略，这些策略在几乎不影响系统实用性的同时，平均可将攻击效果降低29%。