This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the safety filter, designed based on the CBF, to the output generation of the baseline LLM, i.e., the sequence of the token, with the aim of intervening in the generated text. The overall text-generation system is implemented with Llama 3 and a RoBERTa model, and the source code is available at https://github.com/Mya-Mya/CBF-LLM. The experiment demonstrates its control ability and effectiveness in reducing the number of interventions needed for user-specified alignment tasks.
翻译:本文提出了一种基于控制的对齐框架,用于实现大型语言模型(LLMs)的安全对齐。该框架通过利用控制屏障函数(CBF)来确保生成符合用户期望的文本。所提出的框架将基于CBF设计的安全过滤器应用于基线LLM的输出生成过程(即词元序列),旨在对生成的文本进行干预。整个文本生成系统基于Llama 3和RoBERTa模型实现,源代码发布于https://github.com/Mya-Mya/CBF-LLM。实验结果表明,该系统具备良好的控制能力,并能在用户指定的对齐任务中有效减少所需的干预次数。