Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values, posing severe risks to users and society. To mitigate these risks, current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. However, the labor-intensive nature of these benchmarks limits their test scope, hindering their ability to generalize to the extensive variety of open-world use cases and identify rare but crucial long-tail risks. Additionally, these static tests fail to adapt to the rapid evolution of LLMs, making it hard to evaluate timely alignment issues. To address these challenges, we propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two principal stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios. In the Refinement stage, it iteratively refines the scenarios to probe long-tail risks. Specifically, ALI-Agent incorporates a memory module to guide test scenario generation, a tool-using module to reduce human labor in tasks such as evaluating feedback from target LLMs, and an action module to refine tests. Extensive experiments across three aspects of human values--stereotypes, morality, and legality--demonstrate that ALI-Agent, as a general evaluation framework, effectively identifies model misalignment. Systematic analysis also validates that the generated test scenarios represent meaningful use cases, as well as integrate enhanced measures to probe long-tail risks. Our code is available at https://github.com/SophieZheng998/ALI-Agent.git
翻译:大语言模型(LLMs)若与人类价值观未对齐,可能引发非预期甚至有害的内容,对用户和社会构成严重风险。为缓解这些风险,现有评估基准主要采用专家设计的上下文场景来评估LLMs与人类价值观的对齐程度。然而,这些基准的构建过程劳动密集,限制了其测试范围,难以泛化至开放世界中广泛多样的使用场景,也难以识别罕见但关键的长尾风险。此外,这些静态测试无法适应LLMs的快速演进,难以评估时效性的对齐问题。为应对这些挑战,我们提出ALI-Agent,一个利用基于LLM的智能体的自主能力进行深度自适应对齐评估的框架。ALI-Agent通过两个主要阶段运行:模拟与精炼。在模拟阶段,ALI-Agent自动生成真实的测试场景。在精炼阶段,它迭代优化场景以探测长尾风险。具体而言,ALI-Agent包含一个记忆模块以指导测试场景生成,一个工具使用模块以减少人工劳动(例如评估目标LLMs的反馈),以及一个行动模块以优化测试。在人类价值观的三个维度——刻板印象、道德与合法性——上进行的大量实验表明,ALI-Agent作为一个通用评估框架,能有效识别模型未对齐现象。系统分析也验证了生成的测试场景具有实际意义,并整合了增强措施以探测长尾风险。我们的代码公开于 https://github.com/SophieZheng998/ALI-Agent.git。