Information retrieval is a rapidly evolving field. However it still faces significant limitations in the scientific and industrial vast amounts of information, such as semantic divergence and vocabulary gaps in sparse retrieval, low precision and lack of interpretability in semantic search, or hallucination and outdated information in generative models. In this paper, we introduce a two-block approach to tackle these hurdles for long documents. The first block enhances language understanding in sparse retrieval by query expansion to retrieve relevant documents. The second block deepens the result by providing comprehensive and informative answers to the complex question using only the information spread in the long document, enabling bidirectional engagement. At various stages of the pipeline, intermediate results are presented to users to facilitate understanding of the system's reasoning. We believe this bidirectional approach brings significant advancements in terms of transparency, logical thinking, and comprehensive understanding in the field of scientific information retrieval.
翻译:信息检索领域正快速演进,但面对科学及工业领域海量信息时仍存在显著局限:稀疏检索中语义分歧与词汇鸿沟、语义搜索中精度不足与缺乏可解释性、生成模型中的幻觉与信息滞后。本文提出一种双模块方法以应对长文档场景的上述挑战。第一个模块通过查询扩展提升稀疏检索中的语言理解能力,从而检索相关文档;第二个模块仅利用长文档中散布的信息,通过提供全面且富有信息量的答案深化检索结果,实现双向交互。在流程各阶段,系统会向用户呈现中间结果以辅助理解推理过程。我们认为这种双向范式在科学信息检索领域的透明度、逻辑思维与全面理解方面实现了重要突破。