Text readability assessment has gained significant attention from researchers in various domains. However, the lack of exploration into corpus compatibility poses a challenge as different research groups utilize different corpora. In this study, we propose a novel evaluation framework, Cross-corpus text Readability Compatibility Assessment (CRCA), to address this issue. The framework encompasses three key components: (1) Corpus: CEFR, CLEC, CLOTH, NES, OSP, and RACE. Linguistic features, GloVe word vector representations, and their fusion features were extracted. (2) Classification models: Machine learning methods (XGBoost, SVM) and deep learning methods (BiLSTM, Attention-BiLSTM) were employed. (3) Compatibility metrics: RJSD, RRNSS, and NDCG metrics. Our findings revealed: (1) Validated corpus compatibility, with OSP standing out as significantly different from other datasets. (2) An adaptation effect among corpora, feature representations, and classification methods. (3) Consistent outcomes across the three metrics, validating the robustness of the compatibility assessment framework. The outcomes of this study offer valuable insights into corpus selection, feature representation, and classification methods, and it can also serve as a beginning effort for cross-corpus transfer learning.
翻译:文本可读性评估已引起各领域研究者的广泛关注。然而,不同研究团队采用不同语料库,语料库兼容性研究的缺乏构成了一个挑战。本研究提出了一种新颖的评估框架——跨语料库文本可读性兼容性评估(CRCA),以应对这一问题。该框架包含三个关键组成部分:(1)语料库:CEFR、CLEC、CLOTH、NES、OSP和RACE。提取了语言学特征、GloVe词向量表示及其融合特征。(2)分类模型:采用了机器学习方法(XGBoost、SVM)和深度学习方法(BiLSTM、Attention-BiLSTM)。(3)兼容性指标:RJSD、RRNSS和NDCG指标。研究结果发现:(1)验证了语料库的兼容性,其中OSP与其他数据集存在显著差异;(2)语料库、特征表示和分类方法之间存在适应性效应;(3)三个指标呈现出一致的结果,验证了兼容性评估框架的稳健性。本研究成果为语料库选择、特征表示和分类方法提供了宝贵见解,同时也可作为跨语料库迁移学习研究的开端。