One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed 'similarity'. This work introduces the first 'silver standard' dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset's suitability through quality evaluation metrics.
翻译:现代应用计算语言学的关键任务之一是构建词向量表示(词嵌入),其被广泛用于解决情感分析、信息提取等自然语言处理任务。为选择生成这些词嵌入的合适方法,通常需要质量评估技术。一种标准方法涉及计算具有专家评估“相似性”的词语之间的向量距离。本研究介绍了首个用于吉尔吉斯语此类任务的“银标准”数据集,同时训练了相应模型,并通过质量评估指标验证了数据集的适用性。