学会信任群体智慧：面向大型语言模型的多模型共识推理引擎 (Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models)

Large language models (LLMs) achieve strong aver- age performance yet remain unreliable at the instance level, with frequent hallucinations, brittle failures, and poorly calibrated confidence. We study reliability through the lens of multi-model consensus: given responses from several heterogeneous LLMs, can we learn which answer is most likely correct for a given query? We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner. The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors, and then applies gradient-boosted trees, listwise ranking, and graph neural networks over similarity graphs of answers. Using three open-weight LLMs evaluated on compact, resource- constrained subsets of GSM8K, ARC-Challenge, HellaSwag, and TruthfulQA, our best graph-attention-based consensus model improves macro-average accuracy by 4.6 percentage points over the strongest single LLM and by 8.1 points over majority vote, while also yielding lower Brier scores and fewer TruthfulQA hal- lucinations. Ablation and feature-importance analyses show that semantic agreement and clustering features are most influential, with reasoning-quality and model-prior features providing com- plementary gains, suggesting supervised multi-model consensus is a practical route toward more reliable LLM behavior, even in a modest single-machine setup.

翻译：大型语言模型（LLMs）在平均性能上表现强劲，但在实例层面仍不可靠，常出现幻觉、脆弱性故障以及置信度校准不佳等问题。本文从多模型共识的视角研究可靠性问题：给定多个异构LLMs的响应，我们能否学习判断对于特定查询哪个答案最可能是正确的？我们提出了一种多模型共识推理引擎，将LLMs的输出集合作为监督元学习器的输入。该系统通过语义嵌入、成对相似性与聚类统计、词汇与结构线索、推理质量评分、置信度估计以及模型特定先验，将自然语言响应映射为结构化特征，随后在答案相似度图上应用梯度提升树、列表排序和图神经网络。通过在GSM8K、ARC-Challenge、HellaSwag和TruthfulQA的精简资源受限子集上评估三个开源权重LLMs，我们基于图注意力机制的最佳共识模型将宏观平均准确率相较于最强单LLM提升了4.6个百分点，相较于多数投票提升了8.1个百分点，同时获得更低的Brier分数和更少的TruthfulQA幻觉。消融实验与特征重要性分析表明，语义一致性与聚类特征最具影响力，推理质量与模型先验特征则提供互补性增益，这证明监督式多模型共识是实现更可靠LLM行为的实用路径，即使在适度的单机配置中亦是如此。