Reward-based finetuning is crucial for aligning language policies with intended behaviors (e.g., creativity and safety). A key challenge here is to develop steerable language models that trade-off multiple (conflicting) objectives in a flexible and efficient manner. This paper presents Conditioned Language Policy (CLP), a general framework for finetuning language models on multiple objectives. Building on techniques from multi-task training and parameter-efficient finetuning, CLP can learn steerable models that effectively trade-off conflicting objectives at inference time. Notably, this does not require training or maintaining multiple models to achieve different trade-offs between the objectives. Through an extensive set of experiments and ablations, we show that the CLP framework learns steerable models that outperform and Pareto-dominate the current state-of-the-art approaches for multi-objective finetuning.
翻译:基于奖励的微调对于使语言策略与预期行为(如创造性和安全性)保持一致至关重要。其中的一个关键挑战在于开发可引导的语言模型,使其能够以灵活高效的方式权衡多个(可能相互冲突的)目标。本文提出了条件化语言策略,这是一种用于对语言模型进行多目标微调的通用框架。该框架基于多任务训练和参数高效微调技术,能够学习到可引导的模型,这些模型在推理时能有效权衡相互冲突的目标。值得注意的是,该方法无需训练或维护多个模型即可实现目标间的不同权衡。通过一系列广泛的实验和消融研究,我们证明了CLP框架学习到的可引导模型,其性能优于并帕累托支配当前最先进的多目标微调方法。