The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.
翻译:大语言模型(LLMs)的快速发展迫切需要有效的机制,通过准确区分不安全内容与良性内容来确保其负责任部署。尽管英语领域已存在大量安全数据集,但由于其他语言的开源安全数据集有限,多语言安全建模仍处于探索不足的阶段。即便在英语数据集中,安全但敏感的边界案例内容也较为稀缺,这导致模型出现捷径学习行为并产生不可忽视的假阳性率。为缓解这些问题,我们提出一种新颖的极小极大强化学习框架,其中数据生成器与分类器模型协同进化,从而促进高质量合成多语言安全数据的生成。我们从理论上将该交互过程形式化为极小极大博弈,并严格证明其收敛至纳什均衡。实验评估证实,我们的合成数据生成方法能显著提升分类器模型性能,使一个规模远小于现有模型的系统在英语基准测试中超越当前最优水平近10%,同时实现4.5倍的推理速度提升。这些成果为合成数据生成建立了可扩展且高效的方法论,推动了更安全、更鲁棒的多语言LLM部署的发展。