In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol.
翻译:近年来,自动可读性评估(ARA)研究的主要焦点已转向采用昂贵的基于深度学习方法,其首要目标是提升模型准确性。然而,这一方法在低资源语言中鲜有应用,由于缺乏现成的自然语言处理工具来提取深层语言表征,传统手工特征仍在这些语言中广泛使用。本研究从技术层面后退一步,聚焦语言互通度或语言亲缘关系等语言特征如何改善低资源场景下的ARA。我们收集了菲律宾三种语言(他加禄语、比科尔语和宿务语)的短篇故事,以训练可读性评估模型,并探究不同跨语言设定下数据与特征的相互作用。结果表明,CrossNGO——一种利用高互通度语言间n元语法重叠的新型专门化特征——的引入,与单独使用现成大规模多语言语言模型相比,显著提升了ARA模型的性能。因此,当两种语言表征结合使用时,我们为他加禄语和宿务语取得了最先进成果,并为比科尔语的ARA建立了基线分数。