As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/zero Y). We annotate 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98 percent+ accuracy on two sophisticated annotation procedures. A Bayesian multinomial GAM fitted to 44,527 true positives of the evaluative construction reveals previously undocumented genre-specific trajectories of change, enabling us to advance new hypotheses about the relationship between register formality and competing pressures of morphosyntactic reduction and enhancement. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though implementation requires attention to costs, licensing, and other ethical considerations.
翻译:随着自然语言语料库以前所未有的速度扩张,人工标注仍是语料库语言学工作中的重要方法学瓶颈。我们通过提出一种利用大语言模型(LLMs)对海量语料库进行语法标注自动化的可扩展流程来解决这一挑战。与以往有监督和迭代方法不同,我们的方法采用四阶段工作流程:提示工程、先验评估、自动化批量处理和后验验证。我们通过对英语评价性consider结构(consider X as/to be/零标记 Y)历时变异的案例研究,展示了该流程的可操作性与有效性。我们通过OpenAI API在60小时内完成了《美国历史英语语料库》(COHA)中143,933条'consider'索引行的标注,在两个复杂标注程序上实现了98%以上的准确率。对44,527个评价性结构真阳性样本拟合的贝叶斯多项广义可加模型揭示了先前未记录的体裁特异性演变轨迹,使我们能够就语域正式性与形态句法简化和强化竞争压力之间的关系提出新假设。我们的研究结果表明,LLMs能够以最小人工干预大规模执行多种数据预处理任务,从而解锁以往难以触及的实质性研究问题,但实施过程需关注成本、许可协议及其他伦理考量。