Stance detection holds great potential for enhancing the quality of online political discussions, as it has shown to be useful for summarizing discussions, detecting misinformation, and evaluating opinion distributions. Usually, transformer-based models are used directly for stance detection, which require large amounts of data. However, the broad range of debate questions in online political discussion creates a variety of possible scenarios that the model is faced with and thus makes data acquisition for model training difficult. In this work, we show how to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions:(i) We generate synthetic data for specific debate questions by prompting a Mistral-7B model and show that fine-tuning with the generated synthetic data can substantially improve the performance of stance detection. (ii) We examine the impact of combining synthetic data with the most informative samples from an unlabelled dataset. First, we use the synthetic data to select the most informative samples, second, we combine both these samples and the synthetic data for fine-tuning. This approach reduces labelling effort and consistently surpasses the performance of the baseline model that is trained with fully labeled data. Overall, we show in comprehensive experiments that LLM-generated data greatly improves stance detection performance for online political discussions.
翻译:立场检测在提升在线政治讨论质量方面具有巨大潜力,其已被证明在讨论摘要生成、虚假信息检测和意见分布评估等方面具有重要价值。通常,基于Transformer的模型可直接用于立场检测,但这需要大量训练数据。然而,在线政治讨论中广泛存在的辩论议题产生了模型可能面临的各种场景,这使得模型训练所需的数据获取变得困难。本研究展示了如何利用LLM生成的合成数据来训练和改进在线政治讨论的立场检测智能体:(i) 我们通过提示Mistral-7B模型为特定辩论议题生成合成数据,并证明使用生成的合成数据进行微调可显著提升立场检测性能。(ii) 我们研究了将合成数据与未标注数据集中最具信息量的样本相结合的方案。首先利用合成数据筛选最具信息量的样本,随后将这些样本与合成数据共同用于微调。该方法在降低标注成本的同时,其性能持续超越使用全标注数据训练的基线模型。综合实验表明,LLM生成的数据能显著提升在线政治讨论的立场检测性能。