Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $ρ= 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.
翻译:评估事实一致性对于实现可靠的文本摘要至关重要,尤其在医疗保健和新闻等高风险领域。然而,现有的大多数评估指标都忽视了孟加拉语——一种使用广泛但资源匮乏的语言,并且通常依赖于参考摘要。我们提出了BanglaSummEval,一个基于问答的无参考框架,用于评估孟加拉语摘要的事实一致性。该方法通过从源文档和摘要自动生成的问题与答案,同时评估事实准确性和内容覆盖度。一个单一的多语言指令微调语言模型负责处理问题生成、问题回答、候选答案提取以及问题重要性加权。这种统一设计降低了系统复杂性和计算成本。为了捕捉超越表层重叠的语义一致性,我们使用BERTScore-Recall进行答案比较。我们在教育和医疗领域的300个人工撰写的摘要上验证了BanglaSummEval,结果显示其与专家人工判断具有强相关性(皮尔逊相关系数 $r = 0.694$,斯皮尔曼等级相关系数 $ρ= 0.763$)。通过提供可解释的、分步的诊断信息以及可靠的评估分数,BanglaSummEval为低资源语言环境下的事实一致性评估提供了一个实用且透明的解决方案。