Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
翻译:当前偏好学习方法在标准基准测试中实现了高准确率,但当客观质量信号被移除时,其性能会出现显著下降。我们引入了WritingPreferenceBench数据集,该数据集包含8种创意写作体裁下1800个人工标注的偏好对(1200个英文,600个中文),其中所有回复在客观正确性、事实准确性和长度方面均进行了匹配。在此基准测试中,基于序列的奖励模型——RLHF的标准架构——仅达到52.7%的平均准确率,而零样本语言模型评判器的表现为53.9%。相比之下,能生成显式推理链的生成式奖励模型达到了81.8%的准确率。我们观察到不同体裁间模型内部存在高度方差:单个模型在不同写作类别中的准确率从18.2%到81.8%不等,平均标准差为10.1%。这种方差与模型规模无关,270亿参数模型相较于80亿参数变体并未显示出持续改进。我们的结果表明,当前的RLHF方法主要学习检测客观错误,而非捕捉主观质量偏好(例如创造力、风格特色和情感共鸣),并且成功的偏好建模可能需要中间推理表征,而非直接分类。