Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
翻译:监督式文本模型是政治学家的有力工具,但其使用存在若干障碍,包括人工标注文档的成本高昂、检索用于标注的相关稀有文档的困难,以及共享已标注文档涉及的版权和隐私问题。本文针对这三个问题提出部分解决方案,即利用大型语言模型进行受控合成文本生成。我提供了文本生成的概念性概述、研究人员何时应选用不同合成文本生成技术的指导、伦理问题讨论,以及提升合成文本质量的简易技术。我通过三个应用案例展示了合成文本的实用性:生成描述乌克兰战事的合成推文、生成描述特定政治事件的合成新闻文章以训练事件检测系统,以及构建用于训练语句级民粹主义分类器的多语言民粹主义宣言语料库。