L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

翻译：本研究提出了L3Cube-IndicNews——一个多语言文本分类语料库，旨在为印度区域语言构建高质量数据集，重点关注新闻标题与文章。我们聚焦于10种主要印度语言，包括印地语、孟加拉语、马拉地语、泰卢固语、泰米尔语、古吉拉特语、卡纳达语、奥里亚语、马拉雅拉姆语和旁遮普语。每个新闻数据集包含10个及以上的新闻文章类别。L3Cube-IndicNews提供3种针对不同文档长度的独立数据集：短标题分类（SHC）数据集包含新闻标题与类别，长文档分类（LDC）数据集包含完整新闻文章与类别，以及长段落分类（LPC）数据集包含新闻子文章与类别。我们在所有3个数据集中保持一致的标签体系，以支持基于长度的深度分析。我们采用4种不同模型（包括单语BERT、多语言指示句BERT（IndicSBERT）和IndicBERT）对每个印度语言数据集进行评估。本研究显著扩充了现有文本分类数据集的规模，并为印度区域语言的主题分类模型开发提供了可能。同时，由于各语言间标签高度重叠，该资源亦为跨语言分析提供了优秀基础。数据集与模型已在https://github.com/l3cube-pune/indic-nlp 公开共享。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日