聊天机器人LLMs是否话太多？YapBench基准测试 (Do Chatbot LLMs Talk Too Much? The YapBench Benchmark)

Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores. YapBench contains over three hundred English prompts spanning three common brevity-ideal settings: (A) minimal or ambiguous inputs where the ideal behavior is a short clarification, (B) closed-form factual questions with short stable answers, and (C) one-line coding tasks where a single command or snippet suffices. Evaluating 76 assistant LLMs, we observe an order-of-magnitude spread in median excess length and distinct category-specific failure modes, including vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests. We release the benchmark and maintain a live leaderboard for tracking verbosity behavior over time.

翻译：诸如ChatGPT、Claude和Gemini等大型语言模型（LLMs）日益扮演通用辅助工具的角色，但它们常在简单请求上做出不必要的冗长回应，添加冗余解释、模棱两可的表述或模板化内容，从而增加认知负荷并推高基于令牌的推理成本。先前研究表明，基于偏好的后训练和LLM评判的评估可能引发系统性长度偏差——即使质量相当，更长的回答仍会获得更高评价。我们推出YapBench，这是一个用于量化简洁性理想提示下用户可感知过度生成的轻量级基准测试。每个测试项包含单轮提示、人工筛选的最小充分基准答案及类别标签。我们的核心指标YapScore通过计算超出基准字符数的响应长度，实现在不依赖特定分词器的情况下跨模型比较。我们通过YapIndex（各类别中位数YapScore的均匀加权平均值）汇总模型表现。YapBench包含三百余个英文提示，涵盖三类常见的简洁性理想场景：（A）需要简短澄清的最小化或模糊输入；（B）具有简短稳定答案的封闭式事实性问题；（C）单行代码任务（仅需单个命令或代码片段即可解决）。通过对76个辅助型LLMs的评估，我们观察到中位数超额长度存在数量级差异，并识别出特定类别的典型失效模式，包括对模糊输入的“真空填充”行为，以及对单行技术请求的过度解释或格式化冗余。我们公开此基准测试集，并维护实时排行榜以追踪模型冗长行为的动态演变。