FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.

翻译：针对垂直领域的大语言模型微调仍然是一个劳动密集型且成本高昂的过程，需要领域专家进行数据整理、训练配置以及迭代诊断模型行为。尽管自主机器学习日益受到关注，但尚无先前研究利用智能体实现端到端的大语言模型微调。基于大语言模型的智能体能否自动化这一完整流程？我们将此定义为一个高度开放的问题：智能体必须在开放式搜索空间中导航，涵盖从多元数据源进行数据整理、使用复杂工具进行处理、构建训练流水线，以及根据快速增长的日志中的评估结果迭代优化其策略——这一整体场景远比现有基准测试更为复杂。为探究此问题，我们提出了FT-Dojo，一个包含5个领域共13项任务的交互式环境。我们进一步开发了FT-Agent自主系统，该系统通过利用评估驱动的反馈来迭代诊断故障并优化微调策略，从而模拟人类专家的决策过程。在FT-Dojo上的实验表明，专门构建的微调智能体显著优于通用替代方案，其中FT-Agent在所有五个领域的13项任务中，有10项取得了最佳性能。消融实验证明该方法能有效推广至30亿参数模型，并提供了关于数据规模权衡与骨干网络敏感性的深入洞见。案例分析显示，智能体能够通过对历史经验的累积学习从故障中恢复，同时也暴露出其在因果推理方面的根本性局限——这既揭示了自主大语言模型微调的发展潜力，也明确了当前的技术边界。