弥合数据鸿沟：基于英语XSUM创建印地语文本摘要数据集 (Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM)

Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus. This initiative not only provides a direct tool for Hindi NLP research but also offers a scalable methodology for democratizing NLP in other underserved languages. By reducing the costs associated with dataset creation, this work fosters the development of more nuanced, culturally relevant models in computational linguistics.

翻译：当前自然语言处理（NLP）领域的进展主要惠及资源丰富的语言，导致印地语等低资源语言在高质量数据集方面存在显著差距。这种匮乏在文本摘要任务中尤为明显，由于缺乏多样化、专业化的语料库，稳健模型的开发受到阻碍。为应对这一不平衡，本研究提出一种经济高效的自动化框架，用于创建全面的印地语文本摘要数据集。通过以英语极端摘要（XSUM）数据集为源，我们采用先进的翻译与语言适应技术。为确保高保真度和语境相关性，我们使用跨语言优化翻译评估指标（COMET）进行验证，并辅以选择性使用大型语言模型（LLMs）进行数据筛选。最终生成的数据集提供了多样化、多主题的资源，其复杂性可与原始XSUM语料库相媲美。该成果不仅为印地语NLP研究提供了直接工具，更为其他资源匮乏语言的NLP研究民主化提供了可扩展的方法论。通过降低数据集创建成本，本工作促进了计算语言学领域更精细、更具文化相关性的模型发展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

领域特定文本分类中的预训练语言模型新进展：系统综述

专知会员服务

14+阅读 · 2025年10月24日

【博士论文】面向数据的语言生成模型研究

专知会员服务

24+阅读 · 2025年1月19日

【普渡博士论文】具有深度层次结构和有效统计训练的可解释自然语言处理模型，121页pdf

专知会员服务

35+阅读 · 2023年11月5日

ChatAug: 利用ChatGPT进行文本数据增强

专知会员服务

81+阅读 · 2023年3月4日