CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

Document scaling has been a key component in text-as-data applications for social scientists and a major field of interest for political researchers, who aim at uncovering differences between speakers or parties with the help of different probabilistic and non-probabilistic approaches. Yet, most of these techniques are either built upon the agnostically bag-of-word hypothesis or use prior information borrowed from external sources that might embed the results with a significant bias. If the corpus has long been considered as a collection of documents, it can also be seen as a dense network of connected words whose structure could be clustered to differentiate independent groups of words, based on their co-occurrences in documents, known as communities. This paper introduces CommunityFish as an augmented version of Wordfish based on a hierarchical clustering, namely the Louvain algorithm, on the word space to yield communities as semantic and independent n-grams emerging from the corpus and use them as an input to Wordfish method, instead of considering the word space. This strategy emphasizes the interpretability of the results, since communities have a non-overlapping structure, hence a crucial informative power in discriminating parties or speakers, in addition to allowing a faster execution of the Poisson scaling model. Aside from yielding communities, assumed to be subtopic proxies, the application of this technique outperforms the classic Wordfish model by highlighting historical developments in the U.S. State of the Union addresses and was found to replicate the prevailing political stance in Germany when using the corpus of parties' legislative manifestos.

翻译：文档缩放一直是社会科学家在文本数据应用中的关键组成部分，也是政治研究者的主要兴趣领域，他们旨在借助不同的概率与非概率方法，揭示演讲者或政党之间的差异。然而，这些技术大多建立在基于词袋的无假设基础上，或使用来自外部来源的先验信息，这可能会使结果嵌入显著的偏差。如果语料库长期以来被视为文档的集合，它也可以被视为一个由连接词构成的密集网络，其结构可以通过聚类来区分独立的词组，这些词组基于文档中的共现关系，被称为社区。本文介绍了CommunityFish作为Wordfish的增强版本，它基于分层聚类（即Louvain算法）在词空间上生成社区，这些社区是从语料库中涌现出的语义独立n-gram，并将其作为Wordfish方法的输入，而非直接考虑词空间。这一策略增强了结果的可解释性，因为社区具有非重叠结构，因此在区分政党或演讲者方面具有至关重要的信息能力，同时还允许泊松缩放模型更快地执行。除了生成被视为子主题代理的社区外，该技术的应用通过突显美国国情咨文的历史发展而优于经典Wordfish模型，并且在利用政党立法宣言语料库时，成功复现了德国的主流政治立场。