To obtain insights from event data, advanced process mining methods assess the similarity of activities to incorporate their semantic relations into the analysis. Here, distributional similarity that captures similarity from activity co-occurrences is commonly employed. However, existing work for distributional similarity in process mining adopt neural network-based approaches as developed for natural language processing, e.g., word2vec and autoencoders. While these approaches have been shown to be effective, their downsides are high computational costs and limited interpretability of the learned representations. In this work, we argue for simplicity in the modeling of distributional similarity of activities. We introduce count-based embeddings that avoid a complex training process and offer a direct interpretable representation. To underpin our call for simple embeddings, we contribute a comprehensive benchmarking framework, which includes means to assess the intrinsic quality of embeddings, their performance in downstream applications, and their computational efficiency. In experiments that compare against the state of the art, we demonstrate that count-based embeddings provide a highly effective and efficient basis for distributional similarity between activities in event data.
翻译:为从事件数据中获取洞见,先进的过程挖掘方法通过评估活动间的相似性,将其语义关系纳入分析。其中,基于活动共现关系捕捉相似性的分布相似性度量被广泛采用。然而,现有过程挖掘中分布相似性研究多采用为自然语言处理开发的神经网络方法,如word2vec和自编码器。尽管这些方法已被证明有效,但其存在计算成本高、所学表征可解释性有限等缺点。本研究主张采用简约方法对活动分布相似性进行建模。我们提出基于计数的嵌入表示,该方法避免了复杂的训练过程,并提供直接可解释的表征。为支撑对简约嵌入方法的倡导,我们构建了一个综合性基准测试框架,包含评估嵌入内在质量、下游应用性能及计算效率的方法。在与前沿技术的对比实验中,我们证明基于计数的嵌入为事件数据中活动间的分布相似性提供了高效且高效能的计算基础。