LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.

翻译：移民一直是德国政治辩论的核心议题，从战后数百万被驱逐者流离失所，到劳工移民，再到近期的难民潮。传统上，深入研究所涉范围如此广泛的现象中的政治言论，需要大量的人工标注，从而将分析限制在数据的一小部分上。大语言模型（LLMs）为克服这一限制提供了可能途径。利用理论驱动的标注方案，我们考察了LLMs在德国议会辩论中标注团结与反团结子类型的表现，以及由此产生的标签是否能支持有效的下游推断。我们首先对多个LLM进行了全面评估，分析了模型规模、提示策略、微调、历史数据与当代数据，以及系统性错误模式的影响。我们发现，最强的模型，尤其是GPT-5和gpt-oss-120B，在此任务上达到了人类水平的共识，尽管它们的错误仍然是系统性的，并导致下游结果产生偏差。为解决这一问题，我们将软标签模型输出与基于设计的监督学习（DSL）相结合，以减少长期趋势估计中的偏差。除方法论评估外，我们从社会科学视角解读了最终标注结果，以追溯战后及当代德国对移民的团结与反团结趋势。我们的方法显示，战后时期团结水平相对较高，尤其体现在群体性和同情性团结形式上，而自2015年以来，反团结现象显著上升，其框架表现为排斥、不值得性和资源负担。我们认为，LLMs可以支持大规模的社会科学文本分析，但前提是其输出经过严格验证和统计校正。