We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to car damage detection. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.
翻译:本文提出了一项针对新型系统类别——语义查询处理引擎的基准测试。此类系统本质上依赖于前沿大语言模型(LLM)的生成与推理能力。它们通过自然语言指令配置的语义运算符对SQL进行扩展,这些运算符经由LLM进行评估,使用户能够对多模态数据执行多样化操作。本基准测试在三个关键维度上引入多样性:场景、模态与运算符。涵盖场景包括从影评分析到车辆损伤检测等多种情境。在这些场景中,我们覆盖了图像、音频、文本等不同数据模态。最后,查询涉及多样化的运算符集合,包括语义过滤器、连接、映射、排序及分类运算符。我们在三个学术系统(LOTUS、Palimpzest与ThalamusDB)及一个工业系统(Google BigQuery)上评估了本基准。尽管这些结果反映了持续开发中系统的阶段性表现,但本研究揭示了其当前优势与不足,为未来研究方向提供了重要启示。