Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.
翻译:母语识别(NLI)旨在根据作者在使用另一种语言时的写作来分类其母语。历史上,该任务严重依赖耗时的手工语言特征工程,而基于Transformer的NLI模型至今未能提供有效且实用的替代方案。本研究探讨了输入大小是否为限制因素,并表明在Reddit-L2数据集上,使用Big Bird嵌入训练的模型比基于语言特征工程的模型表现优势显著。此外,我们进一步分析了输入长度依赖性,展示了一致的样本外表现,并对嵌入空间进行了定性分析。鉴于该方法的有效性和计算效率,我们认为它为未来的NLI研究提供了有前景的方向。