A Self Consistency Based Reranking for Narrative Question Answering

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

翻译：叙事问答（NQA）是自然语言处理中的一项具有挑战性的任务，要求模型理解长文本语境、捕捉事件间的关联并生成连贯的答案。尽管预训练语言模型取得了近期进展，但现有方法在推理过程中大多依赖单一解码输出，导致其对生成变异性敏感，且常产生不完整或不一致的答案。为解决这一限制，我们提出了一种基于自集成自一致性的重排序框架。该方法为每个故事-问题对生成多个候选答案，并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述形式，同时通过基于共识的选择提升鲁棒性，且无需修改底层架构。该框架将预训练与微调的语言生成技术、多答案推理以及基于相似性的重排序相结合。我们在NarrativeQA数据集上使用多种模型（包括FLAN-T5（Base与Small版本）和Pegasus-Large）在基线设置与微调设置下进行了评估。实验结果表明，所提方法在所有模型上均实现了性能稳定提升。其中，当结合自集成推理时，FLAN-T5-Base获得最佳整体性能，从82.32%提升至86.66%（+4.34%）。此外，Pegasus-Large取得了最大提升幅度，从72.50%跃升至87.07%（+14.57%），充分验证了所提策略的有效性。