Interactive Data Analysis, the collaboration between humans and LLM agents, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic interactive logs for data analysis hinder the quantitative evaluation of Large Language Model (LLM) agents in this task. To mitigate this issue, we introduce Tapilot-Crossing, a new benchmark to evaluate LLM agents on interactive data analysis. Tapilot-Crossing contains 1024 interactions, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, Tapilot-Crossing is constructed by an economical multi-agent environment, Decision Company, with few human efforts. We evaluate popular and advanced LLM agents in Tapilot-Crossing, which underscores the challenges of interactive data analysis. Furthermore, we propose Adaptive Interaction Reflection (AIR), a self-generated reflection strategy that guides LLM agents to learn from successful history. Experiments demonstrate that Air can evolve LLMs into effective interactive data analysis agents, achieving a relative performance improvement of up to 44.5%.
翻译:交互式数据分析通过人类与大语言模型Agent的协作,实现实时数据探索以支撑科学决策。然而,收集真实交互日志进行此类任务定量评估面临高昂成本与技术挑战。针对该问题,我们提出Tapilot-Crossing——专用于评估大语言模型Agent在交互式数据分析场景中表现的新型基准测试。该基准包含1024个交互样本,覆盖四大实际场景:常规场景、操作场景、私有场景及私有操作场景。值得注意的是,Tapilot-Crossing通过经济高效的多智能体环境"决策公司"构建,仅需极少量人工投入。我们在Tapilot-Crossing上对当前主流及前沿大语言模型Agent进行了系统评测,揭示了交互式数据分析任务的固有挑战。此外,我们提出自适应交互反思(AIR)策略,该自生成反思机制引导Agent从历史成功案例中学习。实验表明,AIR可使大语言模型进化成为高效的交互式数据分析Agent,相对性能提升最高达44.5%。