Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

翻译：AI代理正从顾问转变为执行者，代表用户预订旅行、规划菜单和进行采购。现有关于AI与动物福利的基准测试评估模型对问答提示的文本回复，但未明确这些回复中浮现的福利推理能否迁移至代理部署场景（即模型必须通过工具采取行动）。我们提出TAC（旅行代理同理心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理呈现十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。所有模型得分均低于64%的随机水平，最佳表现者（Claude Opus 4.7）为53%。系统提示中单句福利感知提示使Claude和GPT-5.5提升47至63个百分点，GPT-5.2提升26个百分点，而DeepSeek和Gemini提升不足12个百分点。一项辅助性审查（使用Gemini 2.5 Flash Lite作为评判，对两个最佳模型的表现进行288条基础条件记录的审计）未发现任何记录存在评估意识，表明低于随机水平的结果并非源于模型识别出评估。我们讨论了文化领域的类别差异、文本回复福利基准的局限性，以及欧盟通用AI实践准则系统风险框架的影响。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

《代理式人工智能：美国国防部的战略采纳》最新21页报告

专知会员服务

31+阅读 · 2025年11月10日

Al Agent：AI时代的软件革命

专知会员服务

48+阅读 · 2025年5月13日

中国AI Agent行业研究报告（二）

专知会员服务

48+阅读 · 2025年3月13日