As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.
翻译:随着自然语言语料库以前所未有的速度扩张,人工标注仍是语料库语言学研究中重要的方法学瓶颈。我们通过提出一种可扩展的无监督流程来解决这一挑战,该流程利用大语言模型(LLMs)对海量语料库进行语法标注自动化。与以往有监督和迭代方法不同,我们的方法采用四阶段工作流程:提示工程、先验评估、自动批处理和后验验证。我们通过对英语consider结构变异的历时性案例研究,展示了该流程的可行性与有效性。通过OpenAI API调用GPT-5,我们在60小时内完成了《美国历史英语语料库》(COHA)中143,933个句子的标注,在两项复杂标注任务中实现了98%以上的准确率。研究结果表明,LLMs能够以最小人工干预大规模执行多种数据预处理任务,为基于语料库的研究开辟了新可能,但在实施过程中需注意成本、许可协议及其他伦理考量。