Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-class classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-class classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.
翻译:弱监督作为一种有前景的方法,可满足自然语言处理加速发展的需求,实现快速、大规模数据集的构建。通过利用标注函数,弱监督允许从业者创建学习型标签模型以生成软标注数据集,从而快速构建数据集。本文旨在展示如何运用该方法从环保新闻文本中构建印尼语自然语言处理数据集。我们构建了两类数据集:多类分类与情感分类,并使用多种预训练语言模型进行基线实验。基线结果表明:情感分类的测试准确率达59.79%,F1分数为55.72%;多类分类的宏平均F1分数为66.87%,微平均F1分数为71.5%,ROC-AUC达83.67%。此外,我们公开了本研究使用的数据集及标注函数,以供后续研究与探索。