DataEnvGym：具备学生反馈机制的教师环境中的数据生成智能体 (DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback)

The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid, scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent's goal is to improve student performance. Students are iteratively trained and evaluated on generated data, and their feedback (in the form of errors or weak skills) is reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 4 domains (math, code, VQA, and tool-use) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.

翻译：目前，教导模型的训练数据生成过程主要由人类驱动，即人工分析模型弱点并规划如何创建能提升学生模型性能的数据。利用大语言模型作为标注器的方法虽减少了人力投入，但仍需人工解读评估反馈并控制大语言模型以生成学生所需数据。通过创建自主数据生成智能体（即教师）来自动化这一劳动密集型过程是理想方向，但这需要能够模拟反馈驱动、迭代式闭环数据创建的环境。为支持对此类智能体及其模块进行快速、可扩展的测试，我们推出了DataEnvGym——一个面向数据生成智能体的教师环境测试平台。DataEnvGym将数据生成构建为序列决策任务，其核心是一个由数据生成策略（制定训练数据创建计划）和数据生成引擎（将计划转化为数据）组成的智能体，该智能体运行于提供学生反馈的环境中。智能体的目标是提升学生模型性能。学生模型在生成数据上进行迭代训练与评估，其反馈（以错误或薄弱技能形式）在每轮迭代后报告给智能体。DataEnvGym包含多个教师环境实例，涵盖状态表示与动作空间中三个层次的结构化程度。结构化程度更高的环境基于推断技能构建，提供更强的可解释性与课程控制能力。我们支持数学、代码、视觉问答和工具使用四大领域，并测试了多种学生模型与教师智能体。在我们的教学环境中，示例智能体能够跨任务与场景迭代提升学生模型性能。此外，我们证明不同环境可教授不同技能层级，并通过测试关键模块的变体，为未来改进数据生成智能体、引擎及反馈机制的研究指明方向。