Can Commercial LLMs Be Parliamentary Political Companions? Comparing LLM Reasoning Against Romanian Legislative Expuneri de Motive

This paper evaluates whether commercial large language models (LLMs) can function as reliable political advisory tools by comparing their outputs against official legislative reasoning. Using a dataset of 15 Romanian Senate law proposals paired with their official explanatory memoranda (expuneri de motive), we test six LLMs spanning three provider families and multiple capability tiers: GPT-5-mini, GPT-5-chat (OpenAI), Claude Haiku 4.5 (Anthropic), and Llama 4 Maverick, Llama 3.3 70B, and Llama 3.1 8B (Meta). Each model generates predicted rationales evaluated through a dual framework combining LLM-as-Judge semantic scoring and programmatic text similarity metrics. We frame the LLM-politician relationship through principal-agent theory and bounded rationality, conceptualizing the legislator as a principal delegating advisory tasks to a boundedly rational agent under structural information asymmetry. Results reveal a sharp two-tier structure: frontier models (Claude Haiku 4.5, GPT-5-chat, GPT-5-mini) achieve statistically indistinguishable semantic closeness scores above 4.6 out of 5.0, while open-weight models cluster a full tier below (Cohen's d larger than 1.4). However, all models exhibit task-dependent confabulation, performing well on standardized legislative templates (e.g., EU directive transpositions) but generating plausible yet unfounded reasoning for politically idiosyncratic proposals. We introduce the concept of cascading bounded rationality to describe how failures compound across bounded principals, agents, and evaluators, and argue that the operative risk for legislators is not stable ideological bias but contextual ignorance shaped by training data coverage.

翻译：本文通过对比商用大语言模型（LLMs）的输出与官方立法推理，评估其能否作为可靠的政治咨询工具。我们构建了一个包含15项罗马尼亚参议院法律提案及其官方解释性备忘录（expuneri de motive）的数据集，测试了来自三个供应商家族、覆盖多个能力层级的六种模型：GPT-5-mini、GPT-5-chat（OpenAI）、Claude Haiku 4.5（Anthropic），以及Llama 4 Maverick、Llama 3.3 70B和Llama 3.1 8B（Meta）。每个模型生成预测性论证依据，并通过结合LLM作为裁判（LLM-as-Judge）的语义评分与程序化文本相似度指标的双重评估框架进行评测。我们运用委托-代理理论与有限理性框架界定LLM与政治家的关系，将立法者视为在结构性信息不对称条件下向有限理性代理委托咨询任务的委托人。研究结果呈现出鲜明的双层结构：前沿模型（Claude Haiku 4.5、GPT-5-chat、GPT-5-mini）在5分制语义相似度评分中达到统计上无显著差异的4.6分以上，而开源权重模型则整体低一个层级（Cohen's d > 1.4）。然而，所有模型均出现任务依赖性虚构——在标准化立法模板（如欧盟指令转化）上表现良好，但对具有政治特殊性的提案却生成看似合理实则无据的推理。我们提出“级联有限理性”概念，用以描述有限委托方、代理方与评估方之间如何累积性放大错误，并论证立法者面临的操作性风险并非稳定的意识形态偏见，而是由训练数据覆盖范围所塑造的情境性无知。