This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTi\'c) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa
翻译:本报告介绍了GMUNLP参与VarDial 2024中Dialect-Copa共享任务的情况,该任务旨在评估大语言模型(LLMs)对南斯拉夫微小方言的常识推理能力。由于LLMs在标准语言上的表现已得到充分验证,该任务重点评估其处理非标准方言变体的能力。我们提出了一种结合不同语言模型优势的方法,并利用数据增强技术提升在三种南斯拉夫方言(查卡维亚语、切尔卡诺语和托尔拉克语)上的任务性能。实验采用了以语言族为中心的编码器模型(BERTi\'c)和领域无关的多语言模型(AYA-101)。结果表明,在开源模型类别中,所提出的数据增强技术在所有三个测试数据集上均带来了显著的性能提升。本工作突显了数据增强的实用价值以及LLMs处理非标准方言变体的潜力,为推进低资源与方言场景下的自然语言理解这一更广泛目标做出了贡献。代码:https://github.com/ffaisal93/dialect_copa