KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

翻译：知识库问答（KBQA）旨在基于结构化知识库（KB）回答自然语言问题。近期研究通过采用智能体推理范式改进了KBQA，即大型语言模型（LLM）迭代分解问题、生成对应逻辑查询并与知识库交互以推导答案。然而，这些方法通常基于过程监督合成的推理轨迹对LLM进行微调，这种监督对探索的激励较弱，因而无法有效增强智能体推理能力。本文提出KnowCoder-A1，一种能够自主在知识库上执行智能体推理以获取答案的LLM。为激励自主探索，KnowCoder-A1通过多阶段课程强化学习（采用由易到难的课程设计）在仅结果监督下训练LLM。为建立基础智能体能力，KnowCoder-A1首先基于通过结果导向拒绝采样获得的小规模高质量轨迹对LLM进行微调。随后，为缓解仅结果监督固有的奖励稀疏性问题，该方法采用多阶段课程强化学习，其奖励调度遵循从易到难的渐进过程。通过仅结果监督训练，KnowCoder-A1展现出强大的推理行为，并在三个主流数据集上持续超越现有方法。值得注意的是，在GrailQA的零样本子集上，KnowCoder-A1仅使用十二分之一的训练数据即实现了高达11.1%的相对性能提升，充分证明了其卓越的智能体推理能力。