Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.
翻译:指令微调的大语言模型(LLM)能够以低成本对数千条实例进行标注,这为主动学习(AL)带来两个问题:在主动学习回路中,LLM生成的标签能否替代人类标签?当整个语料库可被廉价标注时,主动学习是否仍有必要?我们在包含277,902条德语政治TikTok评论的新数据集(其中25,974条由LLM标注、5,000条由人类标注)上展开研究,在七种条件、四种编码器和十组随机种子设置下对比了LLM与人类标注效果。在模拟人类标注任务的双问题交互界面下,大规模LLM标注以约十分之一的成本(GPT-5.2批处理API成本28美元,Prolific众包平台成本316美元)超越了人类监督分类器。该优势对闭源(GPT-5.2)和开源(Qwen3.5-122B-10B)LLM均成立,在软标签评估中表现稳健,且这一优势源于双问题分解策略——单一提示的基线方法仅能达到与人类监督持平的效果。在LLM标注条件下,主动学习相比随机采样并未展现可靠优势。然而,两类标注方法的错误结构差异显著:仅采用双问题界面的GPT-5.2能生成接近人类水平的虚假/真实阳性平衡分类器,而其他LLM变体则对边境管控和经济竞争话语过度标记。我们公开了数据集与代码。