Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI's embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.
翻译:语义文本相似性(STS)在许多自然语言处理任务中具有关键作用。尽管在高资源语言中已得到广泛研究,但对于斯洛伐克语等低资源语言,STS仍具挑战性。本文对应用于斯洛伐克语的句子级STS方法进行了比较评估,涵盖传统算法、监督机器学习模型及第三方深度学习工具。我们使用传统算法的输出作为特征训练了多种机器学习模型,并通过人工蜂群算法联合指导特征选择与超参数调优。最后,我们评估了多个第三方工具,包括CloudNLP的微调模型、OpenAI的嵌入模型、GPT-4模型以及预训练的SlovakBERT模型。研究结果揭示了不同方法间的权衡关系。