LLM-based agents can complete tasks correctly yet still frustrate users through poor interaction patterns, such as excessive confirmations, opaque reasoning, or misaligned pacing. Current benchmarks evaluate task accuracy but overlook how agents interact: whether they infer preferences from implicit cues, adapt dynamically, or maintain fine-grained interaction quality. We introduce Prefix, a configurable environment that evaluates both what agents accomplish and how they interact. Central to Prefix is the Interaction-as-a-Tool (IaaT) paradigm, which treats interaction behaviors as structured tool calls, unifying them with existing evaluation frameworks. We define 31 preference settings across 14 attributes and formalize user experience (UX) as a core metric alongside task accuracy. A composite LLM-as-a-Judge mechanism across seven UX dimensions achieves strong aggregate reliability (ICC > 0.79), high internal consistency (alpha = 0.943), and human correlation (rho = 0.52-0.78). Preference-aware agents show 7.6% average UX improvement and 18.5% gain in preference alignment. Our work is openly accessible.
翻译:基于大语言模型(LLM)的智能体虽能正确完成任务,却常因不良的交互模式(如过度确认、推理不透明或节奏失调)令用户感到挫败。现有基准主要评估任务准确性,却忽视了智能体的交互方式:例如能否从隐式线索推断偏好、动态适应或维持细粒度的交互质量。我们提出了PrefIx——一个可配置的评估环境,同时衡量智能体完成的任务内容及其交互过程。其核心是“交互即工具”(Interaction-as-a-Tool, IaaT)范式,该范式将交互行为视为结构化的工具调用,并将其与现有评估框架相统一。我们定义了涵盖14个属性的31种偏好设置,并将用户体验(UX)与任务准确性共同形式化为核心评估指标。通过覆盖七个UX维度的复合型LLM-as-a-Judge机制,我们实现了较高的整体信度(ICC > 0.79)、出色的内部一致性(α = 0.943)以及与人工评估的良好相关性(ρ = 0.52-0.78)。具备偏好感知能力的智能体平均实现了7.6%的UX提升与18.5%的偏好对齐度增益。本研究成果已全面开源。