To benefit from the abundance of data and the insights it brings data processing pipelines are being used in many areas of research and development in both industry and academia. One approach to automating data processing pipelines is the workflow technology, as it also supports collaborative, trial-and-error experimentation with the pipeline architecture in different application domains. In addition to the necessary flexibility that such pipelines need to possess, in collaborative settings cross-organisational interactions are plagued by lack of trust. While capturing provenance information related to the pipeline execution and the processed data is a first step towards enabling trusted collaborations, the current solutions do not allow for provenance of the change in the processing pipelines, where the subject of change can be made on any aspect of the workflow implementing the pipeline and on the data used while the pipeline is being executed. Therefore in this work we provide a solution architecture and a proof of concept implementation of a service, called Provenance Holder, which enable provenance of collaborative, adaptive data processing pipelines in a trusted manner. We also contribute a definition of a set of properties of such a service and identify future research directions.
翻译:为从海量数据及其带来的洞察中获益,数据处理管道正被广泛应用于工业与学术界的众多研发领域。实现数据处理管道自动化的一种途径是工作流技术,该技术还能在不同应用领域中支持对管道架构进行协作式的试错性实验。除了此类管道所需具备的必要灵活性外,跨组织协作场景中的交互常因缺乏信任而受阻。尽管捕获与管道执行及所处理数据相关的溯源信息是实现可信协作的第一步,但现有解决方案无法提供针对处理管道变更的溯源——变更对象可以是实现管道的工作流任何环节,也可以是管道执行过程中所使用的数据。因此,本研究提出了一种名为Provenance Holder服务的架构方案与概念验证实现,该服务能以可信方式实现对协同自适应数据处理管道的溯源。我们还贡献了该服务的一组属性定义,并指出了未来研究方向。