Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($α=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.
翻译:基于提示的大语言模型(LLM)正越来越多地用于立场检测任务,但更难的样例往往无法通过更清晰的指令、推理提示、检索或辩论来修正。我们提出SICI(立场推理复杂度指数),这是一个七维诊断指标,用于衡量目标-文本对施加的语义-语用负荷。在SemEval-2016和VAST数据集上,SICI对LLM准确率的预测效果优于表面代理指标,并表现出显著的跨评分者信度(α=0.771)。更重要的是,随着SICI的增加,LLM的错误模式发生范式转换:低复杂度样例容易引发过度归因(尤其是“反对”预测);中等复杂度样例形成不稳定的边界;高复杂度样例则迅速集中于“无立场”。这种类似于相变的结构在GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o中持续存在,尽管更强模型会移动边界。一项包含15种方法的干预研究进一步表明:提示、检索和辩论通常会使模型沿归因-弃权轴发生偏移,而无法消除高复杂度瓶颈。