Academic literature reviews have traditionally relied on techniques such as keyword searches and accumulation of relevant back-references, using databases like Google Scholar or IEEEXplore. However, both the precision and accuracy of these search techniques is limited by the presence or absence of specific keywords, making literature review akin to searching for needles in a haystack. We present vitaLITy 2, a solution that uses a Large Language Model or LLM-based approach to identify semantically relevant literature in a textual embedding space. We include a corpus of 66,692 papers from 1970-2023 which are searchable through text embeddings created by three language models. vitaLITy 2 contributes a novel Retrieval Augmented Generation (RAG) architecture and can be interacted with through an LLM with augmented prompts, including summarization of a collection of papers. vitaLITy 2 also provides a chat interface that allow users to perform complex queries without learning any new programming language. This also enables users to take advantage of the knowledge captured in the LLM from its enormous training corpus. Finally, we demonstrate the applicability of vitaLITy 2 through two usage scenarios. vitaLITy 2 is available as open-source software at https://vitality-vis.github.io.
翻译:传统的学术文献综述通常依赖于关键词搜索和相关反向引用的积累,使用诸如Google Scholar或IEEEXplore等数据库。然而,这些搜索技术的精确度和准确性受限于特定关键词的存在与否,使得文献综述类似于大海捞针。我们提出了vitaLITy 2,这是一种利用大语言模型(LLM)的方法,在文本嵌入空间中识别语义相关的文献。我们包含了一个涵盖1970年至2023年的66,692篇论文的语料库,这些论文可通过三种语言模型创建的文本嵌入进行搜索。vitaLITy 2贡献了一种新颖的检索增强生成(RAG)架构,并可通过一个带有增强提示的LLM进行交互,包括对论文集合的摘要。vitaLITy 2还提供了一个聊天界面,允许用户执行复杂查询,而无需学习任何新的编程语言。这也使用户能够利用LLM从其庞大的训练语料库中捕获的知识。最后,我们通过两个使用场景展示了vitaLITy 2的适用性。vitaLITy 2作为开源软件可在 https://vitality-vis.github.io 获取。