Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
翻译:预测故事相似度可被理解为一种内在地需要解释的任务:对同一文本的不同但同样有效的解读,会产生相异的理解,进而导致不同的相似度判断,这为编码单一真实答案的语义评估基准带来了根本性挑战。我们不将这种多重视角视为需要克服的挑战,而是提议将其融入预测系统的决策过程。为探索这一策略,我们创建了一个由31个LLM角色组成的集成。这些角色涵盖了遵循特定解释框架的实践者,以及更加直觉化、非专业性的读者。我们的实验基于SemEval-2026任务4数据集进行,系统达到了0.705的准确率分数。准确率随集成规模的增大而提高,这与独立性减弱条件下的Condorcet陪审团定理动力学特征一致。实践者角色个体表现较差,但产生的错误相关性较低,从而在多数投票机制下带来更大的集成增益。我们的错误分析揭示,在所有角色类别中,聚焦于性别的解释性词汇与准确率之间存在一致的负相关,这表明模型要么关注了与基准无关的维度,要么产生了基准真实答案中不存在的有效解读。这一发现强调了亟需构建能够解释歧义性多元的评估框架。