Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
翻译:大语言模型在低资源语言上常表现欠佳,主要归因于低效的子词分割和系统性的训练数据失衡。本文提出变熵策略优化(VEPO),该方法利用带可验证奖励的强化学习,将确定性结构约束融入策略对齐过程。该框架确保训练期间强制实现规定的序列长度、稳健的格式一致性以及严格的语言规范性。我们方法的核心是变熵机制,该机制通过调节探索-利用流形,使模型能够动态校准字面忠实度与语义自然性之间的平衡。通过将熵调节的优势估计与非对称裁剪相整合,VEPO在缓解策略崩溃的同时维持了稳健的探索。在90个FLORES-200、COMET-22、chrF方向上的实证评估表明,VEPO在分词效率和翻译质量上均带来了显著提升,缩小了低资源语言之间的性能差距。