Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.
翻译:大型语言模型(LLMs)已成为能够处理众多任务的强大机器学习系统。经过调优的此类系统被转化为可回答用户关于广泛主题查询的聊天机器人,提供信息丰富且富有创意的回复。然而,由于这些模型在物理科学领域知识不够完备,且科学领域对严谨性和溯源有特殊需求,其在该领域的应用仍十分有限。在此,我们展示如何轻松结合现有方法与软件工具,构建出领域特定的聊天机器人。该系统可吸收现有格式的科学文献,并在生成回复时通过文本嵌入检索,为LLM提供领域特定上下文信息。我们同样证明,现有图像嵌入方法可用于跨出版物图片的搜索与检索。这些结果证实,LLM已具备被物理科学家用于加速其研究工作的适用性。