Can Large Language Models Replace Human Coders? Introducing ContentBench

Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control audit. Among the 59 evaluated models, the best low-cost LLMs reach roughly 97-99% agreement with these jury labels, far above GPT-3.5 Turbo, the model behind early ChatGPT and the initial wave of LLM-based text annotation. Several top models can code 50,000 posts for only a few dollars, pushing large-scale interpretive coding from a labor bottleneck toward questions of validation, reporting, and governance. At the same time, small open-weight models that run locally still struggle on sarcasm-heavy items (for example, Llama 3.2 3B reaches only 4% agreement on hard-sarcasm). ContentBench is released with data, documentation, and an interactive quiz at contentbench.github.io to support comparable evaluations over time and to invite community extensions.

翻译：低成本的大型语言模型能否接管目前仍作为大量实证内容分析基础的诠释性编码工作？本文介绍 ContentBench，这是一个公共基准测试套件，通过追踪低成本大型语言模型在相同诠释性编码任务上达成的一致性程度及其成本，来帮助回答这一替代性问题。该套件采用版本化轨道设计，邀请研究者贡献新的基准数据集。我报告了第一条轨道 ContentBench-ResearchTalk v1.0 的结果：该数据集包含 1000 条关于学术研究的合成社交媒体风格帖子，被标注为赞扬、批评、讽刺、提问和程序性评论五类。参考标签仅在三种最先进的推理模型达成完全一致时被采用，且所有最终标签均经过作者的质量控制审核。在评估的 59 个模型中，表现最佳的低成本大型语言模型与这些评审标签的一致性达到约 97-99%，远高于早期 ChatGPT 及首波基于大语言模型的文本标注所采用的 GPT-3.5 Turbo。多个顶尖模型仅需数美元即可完成五万条帖子的编码，从而将大规模诠释性编码从人力瓶颈转向验证、报告与治理层面的问题。与此同时，可在本地运行的小型开源权重模型在讽刺性强的项目上仍表现不佳（例如 Llama 3.2 3B 在强讽刺项目上仅达到 4% 的一致性）。ContentBench 已通过 contentbench.github.io 发布数据、文档及交互式测试，以支持长期可比较的评估，并欢迎社区扩展。