The recent advent of powerful Large-Language Models (LLM) provides a new conversational form of inquiry into historical memory (or, training data, in this case). We show that by augmenting such LLMs with vector embeddings from highly specialized academic sources, a conversational methodology can be made accessible to historians and other researchers in the Humanities. Concretely, we evaluate and demonstrate how LLMs have the ability of assisting researchers while they examine a customized corpora of different types of documents, including, but not exclusive to: (1). primary sources, (2). secondary sources written by experts, and (3). the combination of these two. Compared to established search interfaces for digital catalogues, such as metadata and full-text search, we evaluate the richer conversational style of LLMs on the performance of two main types of tasks: (1). question-answering, and (2). extraction and organization of data. We demonstrate that LLMs semantic retrieval and reasoning abilities on problem-specific tasks can be applied to large textual archives that have not been part of the its training data. Therefore, LLMs can be augmented with sources relevant to specific research projects, and can be queried privately by researchers.
翻译:近期,功能强大的大型语言模型(LLM)的兴起,为探究历史记忆(在本语境中即训练数据)提供了新的对话式研究路径。我们证明,通过将此类LLM与高度专业化学术资源的向量嵌入相结合,可为历史学家及其他人文学科研究者构建可操作的对话式研究方法。具体而言,我们评估并展示了LLM在研究者检阅定制化文档语料库时的辅助能力,该语料库涵盖但不限于:(1)原始史料,(2)专家撰写的二手文献,(3)两者的混合体。相较于数字目录的传统检索界面(如元数据检索与全文搜索),我们评估了LLM更丰富的对话风格在两类核心任务中的表现:(1)问答系统,以及(2)数据提取与组织。研究表明,LLM在特定问题任务中的语义检索与推理能力,可应用于未纳入其训练数据的大规模文本档案。因此,LLM可通过补充特定研究项目的相关史料源来增强效能,并供研究者进行私密化查询。