From Labels to Facets: Building a Taxonomically Enriched Turkish Learner Corpus

In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.

翻译：在标注结构方面，大多数学习者语料库依赖于整体的扁平化标签体系，即使标签数量庞大，也未能明确分离多个语言学维度。这使得深度的语言学标注变得困难，并阻碍了旨在理解学习者为何及如何产生特定错误的细粒度分析。为应对这些局限，本文提出了一种面向学习者语料库的半自动标注方法，该方法基于最近提出的维度化分类法，并通过一种新颖的标注扩展框架实现。该分类法提供了一个理论依据充分的多维分类体系，能够捕捉每个错误实例背后的语言学特性，从而实现超越扁平化标注的标准化、细粒度且可解释的语料增强。基于所提出的扩展框架为土耳其语实现的标注扩展工具，通过推断额外的语言学与元数据信息作为分类法中的维度，自动扩展现有的扁平化标注，从而提供更丰富的学习者特定上下文。该工具经过系统评估，取得了良好的性能结果，在维度级别上达到了95.86%的准确率。由此产生的分类学增强语料库提供了更强的查询能力，并支持跨学习者语料库的详细探索性分析，使研究者能够通过复杂的语言学和教学维度来研究错误模式。本研究首次推出了一个协作标注且分类学增强的土耳其语学习者语料库、一套包含精炼标签集的手动标注指南以及一个标注扩展工具。作为首个依据最新提出的分类法设计的语料库，我们期望本研究能为现有错误标注学习者语料库的后续增强工作铺平道路。