Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
翻译:政治学与社会科学研究者常依赖分类模型分析信息消费趋势,通过检视数百万网页的浏览历史实现。由于人工标注不具可行性,自动化可扩展方法成为必要手段。本文将主题相关内容的检测建模为二分类任务,比较微调预训练编码器模型与上下文学习策略的准确率。我们仅使用每个主题数百个标注数据点,在抓取的网页数据库中检测与三项德国政策相关的内容。对比了多语言与单语言模型、零样本与少样本方法,并探究负采样策略及URL与内容特征组合的影响。实验结果表明,少量标注数据足以训练有效分类器。基于编码器的微调模型性能优于上下文学习方法。同时使用URL与内容特征的分类器表现最佳,而在内容不可用时,仅使用URL特征亦可获得足够效果。