Information retrieval is a rapidly evolving field. However it still faces significant limitations in the scientific and industrial vast amounts of information, such as semantic divergence and vocabulary gaps in sparse retrieval, low precision and lack of interpretability in semantic search, or hallucination and outdated information in generative models. In this paper, we introduce a two-block approach to tackle these hurdles for long documents. The first block enhances language understanding in sparse retrieval by query expansion to retrieve relevant documents. The second block deepens the result by providing comprehensive and informative answers to the complex question using only the information spread in the long document, enabling bidirectional engagement. At various stages of the pipeline, intermediate results are presented to users to facilitate understanding of the system's reasoning. We believe this bidirectional approach brings significant advancements in terms of transparency, logical thinking, and comprehensive understanding in the field of scientific information retrieval.
翻译:信息检索是一个快速发展的领域。然而,在面对科研与工业领域海量信息时,它仍存在显著局限性,例如稀疏检索中的语义分歧与词汇鸿沟、语义搜索中的低精确度与可解释性不足,以及生成式模型中的幻觉与信息过时问题。本文提出了一种双模块方法以应对长文档场景中的这些挑战。第一个模块通过查询扩展增强稀疏检索中的语言理解能力,从而检索相关文档;第二个模块仅利用长文档中的分散信息,对复杂问题提供全面且具有信息量的答案,从而实现双向交互。在流程的各个阶段,系统会向用户展示中间结果,以帮助理解系统的推理过程。我们相信,这种双向范式将在科学信息检索领域的透明度、逻辑思维和全面理解方面带来显著进步。