GPT Assisted Annotation of Rhetorical and Linguistic Features for Interpretable Propaganda Technique Detection in News Text

While the use of machine learning for the detection of propaganda techniques in text has garnered considerable attention, most approaches focus on "black-box" solutions with opaque inner workings. Interpretable approaches provide a solution, however, they depend on careful feature engineering and costly expert annotated data. Additionally, language features specific to propagandistic text are generally the focus of rhetoricians or linguists, and there is no data set labeled with such features suitable for machine learning. This study codifies 22 rhetorical and linguistic features identified in literature related to the language of persuasion for the purpose of annotating an existing data set labeled with propaganda techniques. To help human experts annotate natural language sentences with these features, RhetAnn, a web application, was specifically designed to minimize an otherwise considerable mental effort. Finally, a small set of annotated data was used to fine-tune GPT-3.5, a generative large language model (LLM), to annotate the remaining data while optimizing for financial cost and classification accuracy. This study demonstrates how combining a small number of human annotated examples with GPT can be an effective strategy for scaling the annotation process at a fraction of the cost of traditional annotation relying solely on human experts. The results are on par with the best performing model at the time of writing, namely GPT-4, at 10x less the cost. Our contribution is a set of features, their properties, definitions, and examples in a machine-readable format, along with the code for RhetAnn and the GPT prompts and fine-tuning procedures for advancing state-of-the-art interpretable propaganda technique detection.

翻译：尽管利用机器学习检测文本中的宣传技巧已引起广泛关注，但现有方法大多聚焦于内部机制不透明的"黑箱"解决方案。可解释性方法虽能提供解决方案，却依赖于精细的特征工程与成本高昂的专家标注数据。此外，宣传文本特有的语言特征通常仅受修辞学家或语言学家关注，目前尚缺乏适用于机器学习的、标注此类特征的数据集。本研究系统整理了与说服性语言相关的文献中确定的22项修辞与语言特征，旨在对现有标注宣传技巧的数据集进行特征标注。为协助专家使用这些特征标注自然语言句子，我们专门设计了RhetAnn网络应用程序，以最大程度降低原本繁重的认知负荷。最后，我们利用少量标注数据对生成式大语言模型GPT-3.5进行微调，在优化经济成本与分类准确率的前提下完成剩余数据的标注。本研究证明：将少量人工标注样本与GPT相结合，能以远低于纯人工标注传统成本的方式，实现标注流程的有效规模化扩展。实验结果表明，该方法在成本降低10倍的情况下，性能与当前最佳模型GPT-4持平。我们的贡献包括：一套具有机器可读格式的特征集及其属性、定义与示例，以及用于推进可解释性宣传技巧检测前沿研究的RhetAnn代码、GPT提示工程与微调流程。