Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.
翻译:许多与计算社会科学和网络内容分析相关的任务涉及根据文本中包含的主张对文本片段进行分类。现有先进方法通常需要在大规模标注数据集上微调模型,这会产生高昂的成本。鉴于此,我们提出并发布了一种定性且通用的少样本学习方法论,作为任何基于主张的文本分类任务的通用范式。该方法论包括将类别定义为任意复杂的主张分类体系,并利用自然语言推理模型获取这些主张与目标语料库之间的文本蕴涵关系。随后,通过使用概率二分法这一成熟统计启发式方法动态采样最少的数据点进行标注,进一步提升模型性能。我们在三项任务中展示了该方法的有效性:气候变化怀疑论检测、主题/立场分类以及抑郁相关症状检测。该方法在显著减少数据标注需求的同时,达到了与传统预训练/微调方法相媲美的效果。