Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.
翻译:确保大型语言模型(LLMs)符合安全要求是AI部署中的核心挑战。现有的对齐方法主要在训练阶段进行,例如通过微调或基于人类反馈的强化学习,但这些方法成本高昂且缺乏灵活性,每当出现新需求时都需要重新训练。近期针对推理时对齐的研究在一定程度上缓解了这些限制,但仍假设能够访问模型内部结构,这在实践中难以实现,且不适用于无法访问模型的第三方利益相关者。在本研究中,我们提出了一种与模型无关的黑盒安全对齐框架,无需重新训练或访问底层LLM架构。作为概念验证,我们解决了在生成安全但信息量不足的回复与提供有用但潜在风险的回复之间进行权衡的问题。我们将这一困境表述为一个两人零和博弈,其极小极大均衡捕捉了安全性与有用性之间的最优平衡。LLM智能体通过在推理时利用线性规划求解器计算均衡策略来实施该框架。我们的研究结果证明了黑盒安全对齐的可行性,为包括小型组织和资源受限环境中的实体在内的利益相关者提供了一条可扩展且易于实施的途径,以在快速演进的LLM生态系统中强制执行安全性。