Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully leverage recent breakthroughs in the NLP community involving Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. Nevertheless, we find that naively applying listwise RL fails to produce meaningful improvements, as the model struggles with complex, coarse-grained reward signals, leading to optimization difficulties. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with a simple pointwise reward to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice consists of completions with the same index from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful paradigm for aligning LLMs for complex, ranking-based conditional judgment tasks.
翻译:条件语义文本相似性(C-STS)用于衡量特定条件下文本片段之间的语义邻近度,从而克服传统STS中固有的模糊性。然而,现有方法主要局限于判别式模型,未能充分利用自然语言处理领域近期涉及大语言模型(LLMs)和强化学习(RL)的突破性进展。强化学习是此项任务的理想范式,因为它可以直接优化不可微的斯皮尔曼排序指标,并引导C-STS所需的推理过程。然而,我们发现直接应用列表式强化学习无法产生有意义的改进,因为模型难以处理复杂、粗粒度的奖励信号,导致优化困难。为应对这一挑战,我们提出了PoLi-RL,一种新颖的点对列表强化学习框架。PoLi-RL采用两阶段课程学习策略:首先使用简单的点式奖励训练模型以建立基础评分能力,随后过渡到结合点式、对式和列表式目标的混合奖励,以提升模型辨别细微语义差异的能力。关键的是,我们提出了一种创新的并行切片排序奖励(PSRR)机制,该机制在并行切片中计算排序奖励,其中每个切片由来自不同样本但具有相同索引的补全结果组成。这为每个独立的补全结果提供了精确、差异化的学习信号,实现了细粒度的信用分配和有效优化。在官方C-STS基准测试中,PoLi-RL实现了48.18的斯皮尔曼相关系数,为交叉编码器架构树立了新的SOTA。作为首个成功将强化学习应用于C-STS的研究,我们的工作为针对复杂、基于排序的条件判断任务对齐大语言模型引入了一种强大的范式。