We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
翻译:我们提出了一种为低资源语言生成大规模语义关系数据集的混合方法,并通过一个全面的土耳其语语义关系语料库进行了演示。我们的方法整合了三个阶段:(1) 使用 FastText 词嵌入与凝聚聚类来识别语义簇,(2) 利用 Gemini 2.5-Flash 进行自动语义关系分类,以及 (3) 与精选词典资源进行整合。最终生成的数据集包含 843,000 个独特的土耳其语语义对,涵盖三种关系类型(同义词、反义词、共下位词),其规模是现有资源的 10 倍,而成本极低(65 美元)。我们通过两个下游任务验证了该数据集:一个嵌入模型实现了 90% 的 top-1 检索准确率,一个分类模型达到了 90% 的宏平均 F1 分数。我们可扩展的协议解决了土耳其语自然语言处理中关键的数据稀缺问题,并证明了其对其他低资源语言的适用性。我们公开发布了该数据集和模型。