As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
翻译:随着大语言模型在从医疗保健到金融等高风险企业应用中的部署,确保其遵守组织特定政策已变得至关重要。然而,现有的安全性评估仅关注普遍性危害。我们提出了COMPASS(公司/组织政策对齐性评估),这是首个用于系统评估大语言模型是否遵守组织允许列表和禁止列表政策的框架。我们将COMPASS应用于八个不同的行业场景,生成并验证了5,920个查询,这些查询通过策略性设计的边缘案例测试了常规合规性和对抗鲁棒性。通过对七个最先进模型的评估,我们发现了一个根本性的不对称现象:模型能可靠处理合法请求(准确率>95%),但在执行禁令方面却存在灾难性失败,仅能拒绝13-40%的对抗性禁止列表违规请求。这些结果表明,当前的大语言模型缺乏政策关键型部署所需的鲁棒性,从而确立了COMPASS作为组织人工智能安全性评估的关键框架。