In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp
翻译:本研究提出了L3Cube-IndicNews——一个多语言文本分类语料库,旨在为印度区域语言构建高质量数据集,重点关注新闻标题与文章。我们聚焦于10种主要印度语言,包括印地语、孟加拉语、马拉地语、泰卢固语、泰米尔语、古吉拉特语、卡纳达语、奥里亚语、马拉雅拉姆语和旁遮普语。每个新闻数据集包含10个及以上的新闻文章类别。L3Cube-IndicNews提供3种针对不同文档长度的独立数据集:短标题分类(SHC)数据集包含新闻标题与类别,长文档分类(LDC)数据集包含完整新闻文章与类别,以及长段落分类(LPC)数据集包含新闻子文章与类别。我们在所有3个数据集中保持一致的标签体系,以支持基于长度的深度分析。我们采用4种不同模型(包括单语BERT、多语言指示句BERT(IndicSBERT)和IndicBERT)对每个印度语言数据集进行评估。本研究显著扩充了现有文本分类数据集的规模,并为印度区域语言的主题分类模型开发提供了可能。同时,由于各语言间标签高度重叠,该资源亦为跨语言分析提供了优秀基础。数据集与模型已在https://github.com/l3cube-pune/indic-nlp 公开共享。