Comparing how Large Language Models perform against keyword-based searches for social science research data discovery

This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs particularly well for place based, misspelled, obscure, or complex queries. While the semantic search does not capture all keyword based results, the datasets returned are overwhelmingly semantically similar, with high cosine similarity scores despite lower exact overlap. Rankings of the most relevant results differ substantially between tools, reflecting contrasting prioritisation strategies. Case studies demonstrate that the LLM based tool is robust to spelling errors, interprets geographic and contextual relevance effectively, and supports natural-language queries that keyword search fails to resolve. Overall, the findings suggest that LLM driven semantic search offers a substantial improvement for data discovery, complementing rather than fully replacing traditional keyword-based approaches.

翻译：本文评估了基于大型语言模型（LLM）的语义搜索工具相对于传统基于关键词搜索在数据发现方面的性能。利用真实世界的搜索行为，我们将应用于英国研究与创新署（UKRI）数据服务的定制语义搜索系统与消费者数据研究中心（CDRC）的关键词搜索输出进行了比较。分析基于2023年12月至2024年10月期间从CDRC搜索日志中提取的131个最常用搜索词。我们使用描述性统计、定性检查和定量相似性度量（包括精确数据集重叠度、Jaccard相似度以及从BERT嵌入导出的余弦相似度）来评估返回数据集在数量、重叠度、排序和相关性方面的差异。结果表明，语义搜索返回的结果数量始终多于关键词搜索，并且在地点相关、拼写错误、模糊或复杂查询方面表现尤为出色。虽然语义搜索未能捕获所有基于关键词的结果，但返回的数据集绝大多数在语义上相似，尽管精确重叠度较低，但余弦相似度得分很高。最相关结果的排序在两种工具之间存在显著差异，反映了不同的优先级策略。案例研究表明，基于LLM的工具对拼写错误具有鲁棒性，能有效解释地理和上下文相关性，并支持自然语言查询，而关键词搜索则无法处理此类查询。总体而言，研究结果表明，LLM驱动的语义搜索为数据发现带来了显著改进，它是对传统基于关键词方法的补充而非完全替代。