Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus. This initiative not only provides a direct tool for Hindi NLP research but also offers a scalable methodology for democratizing NLP in other underserved languages. By reducing the costs associated with dataset creation, this work fosters the development of more nuanced, culturally relevant models in computational linguistics.
翻译:当前自然语言处理(NLP)领域的进展主要惠及资源丰富的语言,导致印地语等低资源语言在高质量数据集方面存在显著差距。这种匮乏在文本摘要任务中尤为明显,由于缺乏多样化、专业化的语料库,稳健模型的开发受到阻碍。为应对这一不平衡,本研究提出一种经济高效的自动化框架,用于创建全面的印地语文本摘要数据集。通过以英语极端摘要(XSUM)数据集为源,我们采用先进的翻译与语言适应技术。为确保高保真度和语境相关性,我们使用跨语言优化翻译评估指标(COMET)进行验证,并辅以选择性使用大型语言模型(LLMs)进行数据筛选。最终生成的数据集提供了多样化、多主题的资源,其复杂性可与原始XSUM语料库相媲美。该成果不仅为印地语NLP研究提供了直接工具,更为其他资源匮乏语言的NLP研究民主化提供了可扩展的方法论。通过降低数据集创建成本,本工作促进了计算语言学领域更精细、更具文化相关性的模型发展。