As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.
翻译:随着推荐系统向智能化、多轮对话界面演进,评估范式的发展始终滞后。当前基准测试常采用"大模型即评审"的评估方式,会引入主观性、高成本及评测不一致等问题。我们提出$τ$-Rec这一面向智能化推荐系统的基准测试,采用可验证奖励机制替代主观评估,并设计带有揭示标签的启发式(RTE)机制来管控对话中任务约束条件的呈现方式。通过基于结构化目录谓词对智能体进行测试,并采用pass^k可靠性评估指标,$τ$-Rec为一致性推理提供了系统性测试框架。我们对五个模型系列(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B及GPT-5 mini)的九种配置进行评估,揭示了陡峭的可靠性落差:即使是最优模型,在pass^1和pass^4指标上仅分别达到约57%和约38%的通过率,凸显当前对话式智能体部署中存在的关键短板。全部代码与数据已开源至https://github.com/nbharaths/tau-rec。