How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we study two connected questions: how to monitor the quality of human preference annotators and how to incentivize them to provide high-quality annotations. In current practice, expert-based monitoring is a natural workhorse for quality control, but it performs poorly in preference annotation because annotators are heterogeneous and downstream model performance is an indirect and noisy proxy for annotation quality. We therefore propose a self-consistency monitoring scheme tailored to preference annotation, and analyze the statistical sample complexity of both methods. This practitioner-facing analysis identifies how many inspected samples are needed to reliably assess an annotator and shows when self-consistency monitoring can outperform expert-based monitoring. We then use the resulting monitoring signal as the performance measure in a principal-agent model, which lets us study a second sample-complexity question: how many monitored samples are needed before simple contracts perform close to the ideal benchmark in which annotation quality is perfectly observable. Under this continuous action space, we show that this shortfall scales as $Θ(1/\sqrt{\mathcal{I} n \log n})$ for binary contracts and $Θ(1/(\mathcal{I}n))$ for linear contracts, where $\mathcal{I}$ is the Fisher information and $n$ is the number of samples; we further show that the linear contracts are rate-optimal among general contracts. This contrasts with the known result that binary contracts are optimal and of $\exp(-Θ(n))$ when the action space is discrete \citep{frick2023monitoring}.

翻译：人类标注的偏好数据在大语言模型（LLM）对齐中起着重要作用。本文研究两个相互关联的问题：如何监控人类偏好标注者的标注质量，以及如何激励其提供高质量标注。当前实践中，基于专家的监控是质量控制的主要手段，但在偏好标注中效果不佳，原因在于标注者具有异质性且下游模型性能作为标注质量的间接代理指标存在噪声。为此，我们提出一种专为偏好标注设计的自一致性监控方案，并分析了两种方法的统计样本复杂度。这项面向实践者的分析确定了可靠评估标注者所需检查的样本数量，并揭示了自一致性监控何时能优于基于专家的监控。进而，我们将该监控信号作为主从模型中的绩效度量，用以研究第二个样本复杂度问题：在简单契约接近完美可观测标注质量的理想基准之前，需要多少监控样本？在连续行动空间下，我们证明了二元契约的不足规模为$Θ(1/\sqrt{\mathcal{I} n \log n})$，线性契约的不足规模为$Θ(1/(\mathcal{I}n))$，其中$\mathcal{I}$为Fisher信息量，$n$为样本数；进一步证明线性契约在一般契约中达到最优速率。这与已知结果形成对比：在离散行动空间下，二元契约最优且不足规模为$\exp(-Θ(n))$ \citep{frick2023monitoring}。