In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods.
翻译:在多用户协作的数据科学与分析环境中,同一数据集会生成多个版本。尽管数据版本的管理与存储已在研究文献中得到一定关注,但这些变化的语义本质仍未得到充分探索。本文提出Explain-Da-V框架,旨在解释两个给定数据集版本之间的变化。该框架通过生成基于数据变换的解释来阐明变化,并进一步引入一组评估指标,用以衡量这些解释的有效性、泛化性和可解释性。实验表明,通过适配现有基准测试与新构建的基准测试,Explain-Da-V生成的结果优于现有的数据变换合成方法。