Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
翻译:大型语言模型(LLM)能力的近期提升,为通过相互通信的生成式AI智能体群体协作解决共同任务开辟了应用前景。这引发了关于未经授权共享信息或其他不良智能体协作形式的隐私与安全挑战。现代隐写技术可能使此类动态行为难以被检测。本文通过借鉴人工智能与安全领域的相关概念,全面形式化地阐述了生成式AI智能体系统中的秘密合谋问题。我们研究了使用隐写术的动机,并提出了多种缓解措施。我们的研究构建了一个模型评估框架,可系统化测试实现各类秘密合谋所需的能力。我们基于一系列当代LLM提供了详尽的实证结果。尽管当前模型的隐写能力仍有限,但GPT-4表现出的能力跃升表明,需要持续监测前沿模型的隐写技术能力。最后,我们规划了一个全面的研究计划,以缓解未来生成式AI模型间合谋的风险。