Being aware of important news is crucial for staying informed and making well-informed decisions efficiently. Natural Language Processing (NLP) approaches can significantly automate this process. This paper introduces the detection of important news, in a previously unexplored area, and presents a new benchmarking dataset (Khabarchin) for detecting important news in the Persian language. We define important news articles as those deemed significant for a considerable portion of society, capable of influencing their mindset or decision-making. The news articles are obtained from seven different prominent Persian news agencies, resulting in the annotation of 7,869 samples and the creation of the dataset. Two challenges of high disagreement and imbalance between classes were faced, and solutions were provided for them. We also propose several learning-based models, ranging from conventional machine learning to state-of-the-art transformer models, to tackle this task. Furthermore, we introduce the second task of important sentence detection in news articles, as they often come with a significant contextual length that makes it challenging for readers to identify important information. We identify these sentences in a weakly supervised manner.
翻译:及时了解重要新闻对于高效掌握信息并做出明智决策至关重要。自然语言处理(NLP)方法可显著实现这一过程的自动化。本文首次在先前未涉及的领域引入重要新闻检测任务,并提出了一个用于波斯语重要新闻检测的新型基准数据集(Khabarchin)。我们将重要新闻文章定义为对相当一部分社会群体具有意义、能够影响其思维方式或决策的内容。新闻文章来自七家不同的主流波斯语新闻机构,最终标注了7,869个样本并构建了数据集。我们面临了类别间高度不一致和不平衡这两大挑战,并提供了相应解决方案。此外,我们提出了多种基于学习的模型,涵盖从传统机器学习到最先进的Transformer模型,以解决该任务。同时,我们引入了第二个任务——新闻文章中的重要句子检测,因为新闻通常长度较长,使得读者难以快速定位关键信息。我们采用弱监督方式识别这些句子。