AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.
翻译:AI代理正从顾问转变为执行者,代表用户预订旅行、规划菜单和进行采购。现有关于AI与动物福利的基准测试评估模型对问答提示的文本回复,但未明确这些回复中浮现的福利推理能否迁移至代理部署场景(即模型必须通过工具采取行动)。我们提出TAC(旅行代理同理心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理呈现十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。所有模型得分均低于64%的随机水平,最佳表现者(Claude Opus 4.7)为53%。系统提示中单句福利感知提示使Claude和GPT-5.5提升47至63个百分点,GPT-5.2提升26个百分点,而DeepSeek和Gemini提升不足12个百分点。一项辅助性审查(使用Gemini 2.5 Flash Lite作为评判,对两个最佳模型的表现进行288条基础条件记录的审计)未发现任何记录存在评估意识,表明低于随机水平的结果并非源于模型识别出评估。我们讨论了文化领域的类别差异、文本回复福利基准的局限性,以及欧盟通用AI实践准则系统风险框架的影响。