We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.
翻译:我们提出了一个针对新型系统类别的基准测试:语义查询处理引擎。这些系统本质上依赖于最先进的大型语言模型(LLMs)的生成与推理能力。它们通过自然语言指令配置的语义运算符扩展了SQL,这些运算符通过LLMs进行评估,使用户能够对多模态数据执行多种操作。我们的基准测试在三个关键维度上引入了多样性:场景、模态和运算符。涵盖的场景包括从电影评论分析到医学问答。在这些场景中,我们覆盖了不同的数据模态,包括图像、音频和文本。最后,查询涉及多样化的运算符集合,包括语义过滤器、连接、映射、排序和分类运算符。我们在三个学术系统(LOTUS、Palimpzest和ThalamusDB)以及一个工业系统Google BigQuery上评估了我们的基准测试。尽管这些结果反映了持续开发中系统的当前状态,但我们的研究为它们现有的优势与不足提供了关键见解,并指明了未来研究的有前景方向。