Determining the difficulty of a text involves assessing various textual features that may impact the reader's text comprehension, yet current research in Vietnamese has only focused on statistical features. This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability. Our research utilized three distinct datasets: the Vietnamese Text Readability Dataset (ViRead), OneStopEnglish, and RACE, with the latter two translated into Vietnamese. Advanced semantic analysis methods were employed for the semantic aspect using state-of-the-art language models such as PhoBERT, ViDeBERTa, and ViBERT. In addition, statistical methods were incorporated to extract syntactic and lexical features of the text. We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees and evaluated their performance using accuracy and F1 score metrics. Our results indicate that a joint approach that combines semantic and statistical features significantly enhances the accuracy of readability classification compared to using each method in isolation. The current study emphasizes the importance of considering both statistical and semantic aspects for a more accurate assessment of text difficulty in Vietnamese. This contribution to the field provides insights into the adaptability of advanced language models in the context of Vietnamese text readability. It lays the groundwork for future research in this area.
翻译:确定文本难度涉及评估可能影响读者文本理解的各种文本特征,然而当前越南语的相关研究仅聚焦于统计特征。本文提出了一种整合统计与语义方法的新途径来评估文本可读性。我们的研究使用了三个独立数据集:越南语文本可读性数据集(ViRead)、OneStopEnglish 以及 RACE,其中后两个数据集被翻译为越南语。在语义层面,我们采用了基于前沿语言模型(如 PhoBERT、ViDeBERTa 和 ViBERT)的先进语义分析方法。此外,还结合统计方法来提取文本的句法与词汇特征。我们使用多种机器学习模型(包括支持向量机(SVM)、随机森林和极端随机树)进行了实验,并通过准确率和 F1 分数指标评估其性能。结果表明,相较于单独使用任一方法,结合语义与统计特征的联合方法能显著提升可读性分类的准确率。本研究强调了同时考虑统计与语义层面对于更准确评估越南语文本难度的重要性。这一领域贡献为高级语言模型在越南语文本可读性背景下的适应性提供了见解,并为该领域的未来研究奠定了基础。