In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector databases have emerged, embedding unstructured data for efficient top-k queries based on textual queries. This paper introduces a novel framework SSQL - Semantic SQL that utilizes these two approaches, enabling the incorporation of semantic queries within SQL statements. Our approach extends SQL queries with dedicated keywords for specifying semantic queries alongside predicates related to ML model results and metadata. Our experimental results show that using just semantic queries fails catastrophically to answer count and spatial queries in more than 60% of the cases. Our proposed method jointly optimizes the queries containing both semantic predicates and predicates on structured tables, such as those generated by ML models or other metadata. Further, to improve the query results, we incorporated human-in-the-loop feedback to determine the optimal similarity score threshold for returning results.
翻译:近年来,随着机器学习(ML)的进步,非结构化数据分析的激增催生了处理图像、文本文档和视频的多种方法。分析人员借助ML模型可从非结构化数据中提取有意义的信息,并将其存储于关系数据库,进而执行SQL查询以进行深度分析。与此同时,向量数据库应运而生,它通过嵌入非结构化数据实现基于文本查询的高效top-k查询。本文提出一种新型框架SSQL(Semantic SQL)——该框架综合运用上述两种方法,支持在SQL语句中嵌入语义查询。我们的方法通过专用关键词扩展SQL查询,以便在指定与ML模型结果和元数据相关的谓词时,同时明确语义查询。实验结果表明,仅使用语义查询在超过60%的案例中无法有效回答计数查询和空间查询。我们提出的方法联合优化了同时包含语义谓词和结构化表(如ML模型生成的结果或其他元数据)上的谓词的查询。此外,为提升查询结果质量,我们引入了人工反馈机制以确定返回结果的最佳相似度阈值。