The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.
翻译:现实世界的信息源本质上是多语言的,这自然引发了一个问题:语言模型能否跨语言综合信息?本文引入一个简单的双跳问答场景,其中回答问题需要对两篇多语言文档进行推理。我们发现,尽管两篇文档对回答问题同等重要,但语言模型对答案片段文档的语言变化比对提供桥梁信息的文档更为敏感。通过逐步子问题评估,我们进一步表明,在多语言场景中高达33%的情况下,模型虽未能正确推断第一步的桥梁信息,却仍能正确回答整体问题。这表明语言模型的推理过程,尤其是在多语言场景中,并未遵循可靠的逐步分解机制。随后,我们证明推理分解的缺失会导致约18%的组合失效,即两个子问题均被正确回答,但最终的双跳问题却回答错误。为缓解此问题,我们提出一种简单的三阶段SUBQ提示方法,通过子问题引导多步推理,将准确率从10.1%提升至66.5%。