Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

翻译：语料库语言学传统上依赖人类研究者提出假设、构建查询及解释结果——这一过程需要专门技术技能和大量时间。我们提出Agent驱动的语料库语言学方法，该方法通过结构化工具使用接口将大型语言模型连接至语料库查询引擎，使其承担研究循环：生成假设、查询语料库、解释结果并跨多轮优化分析。人类研究者设定方向并评估最终输出。与无约束的大型语言模型生成不同，每个发现都锚定于可验证的语料证据。我们未将此视为对基于语料库/语料库驱动二分法的替代，而是作为补充维度：其关注的是研究执行主体，而非理论与数据间的认识论关系。通过将大型语言模型Agent经模型上下文协议（MCP）连接至CQP索引的古腾堡语料库（500万词次），我们演示了该框架。仅输入"调查英语强化词"后，Agent识别出历时接力链（so+形容词 > very > really）、三条语义变化路径（去词汇化、极性固化、隐喻约束）及语域敏感分布。受控基线实验表明，语料库基础提供了模型无法仅从训练数据产生的量化与可证伪性。为检验外部效度，Agent在CLMET语料库（4000万词次）上复现了Claridge（2025）与De Smet（2013）两项已发表研究，获得高度定量一致性。由此，Agent驱动的语料库研究能以机器速度产生经验性实证发现，降低更广泛研究群体的技术门槛。