Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature

Fragmentation is common in interdisciplinary fields with diverse methods and theoretical commitments. Predictive coding neuroscience is a clear example: its literature spans computational theory, electrophysiology, imaging, behavior, and modeling, creating a synthesis problem that conventional meta-analysis cannot easily resolve. Here, we describe a local multi-LLM pipeline for ontology-constrained literature synthesis. The pipeline reads papers, extracts evidence, incorporates figure descriptions, assembles constrained prompts, and validates outputs against an expert glossary. We manually defined a predictive-coding glossary of thirty-six concepts grouped into three hypotheses: predictive suppression, feedforward error propagation, and ubiquity. A council of ten local language models scored 31 studies according to their agreement or disagreement with each glossary factor across local and global oddball contexts. This enabled pairwise study-agreement analysis, cross-model comparison, and three-dimensional hypothesis-space mapping. Agreement was high for some hypotheses but weaker for others, revealing structured disagreement, particularly across local versus global oddball paradigms. We further define hypothesis-space temperature, a geometric dispersion metric measuring how compactly studies occupy the hypothesis space. Temperature was lower for local oddball contexts and higher for global oddball contexts, indicating greater dispersion in the latter. The scoring geometry also allowed us to estimate vectors of change between experimental contexts. These results demonstrate that local multi-LLM councils can produce auditable disagreement measurements that map heterogeneous literatures into quantitative evidence spaces. This framework may generalize to cross-study hypothesis mapping where conventional meta-analysis lacks a common comparison space.

翻译：碎片化在拥有多样方法和理论承诺的跨学科领域中普遍存在。预测编码神经科学是一个明显例子：其文献涵盖计算理论、电生理学、影像学、行为学和建模，造成了传统荟萃分析难以解决的整合问题。本文描述了一个用于本体论约束文献整合的本地多LLM流水线。该流水线能读取论文、提取证据、整合图表描述、组装约束提示，并根据专家术语表验证输出。我们手动定义了一个包含36个概念的预测编码术语表，这些概念被分为三个假设：预测抑制、前馈误差传播和普遍性。由十个本地语言模型组成的评审团，根据每个术语因子在局部和全局异常刺激背景下与研究的支持或反对程度，对31项研究进行评分。这使得能够进行研究对一致性分析、跨模型比较以及三维假设空间映射。某些假设的一致性较高，而其他假设的一致性较弱，揭示了结构化的分歧，尤其是在局部与全局异常刺激范式之间。我们进一步定义了假设空间温度，这是一种几何分散度指标，用于衡量研究在假设空间中的紧密程度。局部异常刺激背景下的温度较低，全局异常刺激背景下的温度较高，表明后者中的分散度更大。评分几何特性还使我们能够估算实验背景之间的变化向量。这些结果表明，本地多LLM评审团能够生成可审计的分歧测量结果，将异质文献映射到定量证据空间中。该框架可推广至常规荟萃分析缺乏共同比较空间的跨研究假设映射。