Recent capability increases in large language models (LLMs) open up applications in which teams of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both the AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
翻译:近期大型语言模型(LLM)能力的提升催生了由通信型生成式AI代理团队协作完成联合任务的应用场景。这引发了关于未经授权信息共享或其他非预期代理协调形式的隐私与安全挑战。现代隐写技术可能使此类动态行为难以被检测。本文借鉴人工智能与安全领域的相关概念,系统性地形式化定义了生成式AI代理系统中的秘密勾结问题。我们研究了使用隐写技术的动机,并提出多种缓解措施。通过研究,我们构建了一个模型评估框架,能够系统性地测试实现各类秘密勾结所需的能力。我们针对当代多种LLM提供了详实的实证结果。尽管当前模型的隐写能力仍有限,但GPT-4展现出能力跃升,这表明需要持续监测前沿模型在隐写方面的能力。最后,我们提出了一项全面的研究计划,以缓解未来生成式AI模型之间的勾结合做风险。