Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.
翻译:将定量模型拟合数据是科学工作流程中的核心步骤,但仍是自动化程度最低的环节之一。近期基于代理的系统利用语言和视觉-语言模型(VLM)迭代提出并优化统计模型,但这些系统在更具挑战性的建模任务中表现欠佳。为此,我们提出VESTA(基于统计工具代理的视觉探索)框架,该框架为VLM配备动态扩展的探索工具箱,通过数据变换、假设驱动型可视化及稳健统计检验指导模型优化。与依赖迭代批判的先前系统不同,VESTA在模型优化前及优化过程中主动探索数据,通过选择或创建诊断工具(这些工具将累积于模型上下文中并可复用)实现优化。我们在三种工具箱配置下评估VESTA与基线方法的性能:无工具、静态专家编写工具、动态模型编写工具。为支持评估,我们提出DAWN(自动化工作流与数值建模数据集)基准测试,针对难度层级不同的分布拟合与时序建模任务,并涵盖真实天文学任务(包括初始质量函数建模与引力波啁啾信号建模)。实验表明,VESTA的动态工具创建能力显著优于现有代理流水线,在复杂领域特定任务上增益最大。我们进一步证明,动态生成的工具在功能复杂度上远超现有视觉工具创建系统的输出结果,每个函数覆盖更多诊断类别,且强烈偏向于VLM批判器可直接推理的视觉输出形式。