Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.
翻译:大型语言模型在需要从长文本或复杂上下文中识别关键微小证据(如工具追踪单行信息或图像细微细节)时经常失败。我们提出ContextRL,一种通过间接辅助目标改善长程推理与多模态性能的上下文感知强化学习方法。不同于仅监督最终答案,该方法向模型提供查询、答案及两个高度相似的上下文,通过奖励机制促使模型选择支持查询-答案对的正确上下文,从而强化细粒度语义对齐。我们在两个领域构建对比上下文数据:针对编程智能体,以执行轨迹为上下文生成基于条件过滤的1000对数据;针对多模态推理,以图像为上下文通过生成式编辑与相似度搜索构建7000对数据。在5项长程基准测试中,ContextRL较标准GRPO平均提升2.2%;在12项多模态视觉问答基准中平均提升1.8%。为剥离目标函数与额外数据的影响,我们将相同对比数据重构为传统查询-上下文-答案样本建立数据增强基线。该基线未见显著提升,证明性能增益源于所提出的上下文选择目标,而非单纯对比数据本身。