Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.
翻译:大语言模型(LLMs)能够以低成本生成调查响应,但其可靠性在不同问题上差异显著,且数据收集前无法预知。在调查中部署LLM仍需昂贵的人工响应进行验证与校正。如何在实时调查中跨问题分配有限的人工标注预算?我们提出一种自适应分配算法,该算法能在收集人工响应的同时学习LLM最难处理的题目。每一条人工标注具有双重作用:既改善该问题的估计精度,又揭示LLM对其预测的准确程度。该算法无需预先了解问题级别的LLM准确率,即可将更多预算分配给LLM最不可靠的问题。我们证明,随着预算增加,算法与最优分配方案之间的分配差距趋近于零,并在合成数据及含68个问题、超过2000名受访者的真实调查数据集上验证了该方法。在真实调查数据中,传统跨问题均匀分配人工标注的实践相比最优方案浪费10-12%的预算;本算法将浪费降至2-6%,且优势随LLM预测质量在问题间差异增大而扩大。该算法用更少的人工样本达到与传统均匀采样相同的估计质量,无需预实验,且具有经过真实数据验证的形式化性能保证。更广泛而言,该框架适用于任何需要在LLM可靠性未知的任务间分配稀缺人工监督的场景。