Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-class classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-class classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.
翻译:弱监督技术已成为应对自然语言处理加速发展需求、实现快速大规模数据集创建的有效途径。通过利用标注函数,弱监督方法使从业者能够通过构建学习到的标签模型快速生成数据集,从而产出软标注数据。本文旨在展示如何运用该方法从保护新闻文本中构建印度尼西亚语自然语言处理数据集。我们构建了两种类型的数据集:多类分类数据集与情感分类数据集,并使用多种预训练语言模型开展了基线实验。实验结果表明:情感分类任务测试准确率达59.79%,F1分数达55.72%;多类分类任务宏平均F1分数为66.87%,微平均F1分数为71.5%,ROC-AUC值达83.67%。此外,我们将本研究所用的数据集与标注函数公开,以供后续研究与探索。