Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.
翻译:大型语言模型(LLMs)已被广泛应用于各类应用中,常作为自主智能体在多智能体系统中相互交互。尽管这些系统在增强能力与实现复杂任务方面展现出潜力,但也带来了显著的伦理挑战。本立场论文从机制可解释性的视角出发,提出了一个旨在确保LLMs多智能体系统(MALMs)伦理行为的研究议程。我们识别出三个关键研究挑战:(i)开发全面的评估框架,以在个体、交互和系统层面评估伦理行为;(ii)通过机制可解释性阐明导致涌现行为的内部机制;(iii)实施针对性的参数高效对齐技术,以引导MALMs朝向伦理行为,同时不损害其性能。