Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \cite{swayamdipta-etal-2020-dataset} implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.
翻译:语言在地理区域或文化间的差异对于避免在面向文化敏感任务(如仇恨言论检测或与对话代理的交互)设计的NLP系统中产生偏见至关重要。在诸如西班牙语这类变体间可能存在显著重叠的语言中,许多示例在不同变体间均可能有效,我们称之为常见示例。忽略这些示例可能导致误分类,从而降低模型的准确性和公平性。因此,考虑这些常见示例对于提升基于此类数据训练的NLP系统的鲁棒性和代表性至关重要。在本研究中,我们在西班牙语变体的背景下探讨此问题。我们利用训练动态来自动检测现有西班牙语数据集中的常见示例或错误。我们证明了在Datamaps \cite{swayamdipta-etal-2020-dataset}实现中使用预测标签置信度来识别难以分类的示例(尤其是常见示例)的有效性,从而提升了变体识别任务中的模型性能。此外,我们引入了一个带有常见示例标注的古巴西班牙语变体识别数据集,该数据集旨在促进对古巴及加勒比地区西班牙语变体更准确的检测。据我们所知,这是首个专注于识别古巴或任何其他加勒比地区西班牙语变体的数据集。