Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks. However, this paradigm ignores interaction dynamics: successful trajectories may lack error recovery or exhibit redundancy, while pass rates fail to distinguish structurally informative tasks from trivial ones. We propose \textbf{TopoCurate}, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology. By merging equivalent action-observation states, this projection transforms scattered linear trajectories into a structured manifold that explicitly captures how tool invocations and environmental responses drive the divergence between effective strategies and failure modes. Leveraging this representation, we introduce a dual-selection mechanism: for SFT, we prioritize trajectories demonstrating reflective recovery, semantic efficiency, and strategic diversity to mitigate covariate shift and mode collapse; for RL, we select tasks with high error branch ratios and strategic heterogeneity, maximizing gradient Signal-to-Noise Ratio to address vanishing signals in sparse-reward settings. Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines. We will release the code and data soon for further investigations.
翻译:训练工具使用智能体通常依赖于基于结果的筛选方法:对成功轨迹进行监督微调,并在通过率筛选的任务上进行强化学习。然而,该范式忽略了交互动态:成功轨迹可能缺乏错误恢复能力或存在冗余,而通过率则无法区分具有结构信息量的任务与平凡任务。我们提出\textbf{TopoCurate},一种交互感知框架,它将来自同一任务的多轮次运行轨迹投影到一个统一的语义商拓扑空间中。通过合并等价的动作-观察状态,该投影将分散的线性轨迹转化为结构化流形,从而显式地捕捉工具调用与环境响应如何驱动有效策略与失败模式之间的分化。利用该表示,我们引入一种双重选择机制:对于监督微调,我们优先选择那些展现反思性恢复、语义效率及策略多样性的轨迹,以缓解协变量偏移和模式崩溃;对于强化学习,我们选择具有高错误分支率和策略异质性的任务,最大化梯度信噪比,以应对稀疏奖励设置中的信号消失问题。在BFCLv3和Tau2基准上的评估表明,TopoCurate相较于最先进的基线方法,分别取得了4.2%(监督微调)和6.9%(强化学习)的稳定性能提升。我们将很快发布代码与数据以供进一步研究。