Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
翻译:语言文档中的词汇数据收集常包含转录错误和未记录的借用词,这些错误可能误导语言学分析。本文提出无监督异常检测方法,用于识别词汇表中的音系结构不一致性,并将其应用于包含孟加拉语影响的多种科克博罗克语变体的多语言数据集。通过字符级和音节级音系结构特征,我们的算法能够识别潜在的转录错误和借用词。尽管由于这些异常现象的微妙性导致精确率和召回率仍处于中等水平,但具备音节感知的特征显著优于字符级基线方法。高召回率策略为田野调查工作者提供了系统化的方法,可标记需要核实的词条,从而支持低资源语言文档数据质量的提升。