CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster-level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings ("canaries") that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF-IDF embeddings and k-means clustering on 3,000 synthetic conversations (24 topics) with a canary injection rate of 0.60, we evaluate an intentionally extractive example snippet summarizer that models quote-like reporting. In this configuration, we observe canary leakage in 50 of 52 canary-containing clusters (cluster-level leakage rate 0.961538), along with nonzero regex-based PII indicator counts. A minimal defense combining a minimum cluster-size publication threshold (k-min = 25) and regex-based redaction eliminates measured canary leakage and PII indicator hits in the reported run while maintaining a similar cluster-coherence proxy. We position this work as a societal impacts contribution centered on privacy risk measurement for published analytics artifacts rather than raw user data.

翻译：对话数据的聚合分析在大语言模型系统中日益广泛地应用于安全监控、治理和产品分析。一种常见做法是嵌入对话、对其进行聚类，并发布描述每个聚类的简短文本摘要。虽然原始对话可能永远不会被公开，但这些派生的摘要如果包含个人身份信息（PII）或从个别对话中复制的唯一可追溯字符串，仍可能构成隐私风险。我们引入了CanaryBench，这是一个针对集群级对话摘要中隐私泄露的简单且可复现的压力测试。CanaryBench生成植入秘密字符串（"金丝雀"）的合成对话，以模拟敏感标识符。由于金丝雀是事先已知的，这些字符串在任何已发布摘要中的出现都构成可测量的泄露。我们在3,000个合成对话（24个主题）上使用TF-IDF嵌入和k-means聚类，金丝雀注入率为0.60，评估了一个有意设计为抽取式的示例片段摘要器，该摘要器模拟了类似引用的报告方式。在此配置下，我们在52个包含金丝雀的集群中的50个观察到了金丝雀泄露（集群级泄露率0.961538），同时存在基于正则表达式的非零PII指标计数。一种结合最小集群规模发布阈值（k-min = 25）和基于正则表达式的信息遮盖的最小化防御措施，在报告的运行中消除了可测量的金丝雀泄露和PII指标命中，同时保持了相似的集群内聚性代理指标。我们将这项工作定位为一项社会影响贡献，其核心关注点在于对已发布的聚合分析产物（而非原始用户数据）的隐私风险度量。