Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden assumptions often lead to crashing tasks without a reasonable error message, poor performance in general, non-terminating executions, or silent wrong results of the DAW, to name only a few possible consequences. Searching for the causes of such errors and drawbacks in a distributed compute cluster managed by a complex infrastructure stack, where DAWs for large datasets typically are executed, can be tedious and time-consuming. We propose validity constraints (VCs) as a new concept for DAW languages to alleviate this situation. A VC is a constraint specifying some logical conditions that must be fulfilled at certain times for DAW executions to be valid. When defined together with a DAW, VCs help to improve the portability, adaptability, and reusability of DAWs by making implicit assumptions explicit. Once specified, VC can be controlled automatically by the DAW infrastructure, and violations can lead to meaningful error messages and graceful behaviour (e.g., termination or invocation of repair mechanisms). We provide a broad list of possible VCs, classify them along multiple dimensions, and compare them to similar concepts one can find in related fields. We also provide a first sketch for VCs' implementation into existing DAW infrastructures.
翻译:将科学数据分析工作流(DAW)移植到集群基础设施、新的软件栈,甚至仅移植到具有显著不同属性的新数据集时,往往面临挑战。尽管在DAW规范中,复杂数据分析的步骤(任务)及其相互依赖关系具有结构化定义,但相关假设可能仍然未明确指定且隐含存在。此类隐藏假设常导致任务崩溃且无合理错误信息、整体性能低下、执行无法终止,或DAW产生静默错误结果——仅列举若干可能后果。在管理大型数据集DAW执行的复杂基础设施栈的分布式计算集群中,查找此类错误与缺陷的根源可能既繁琐又耗时。我们提出有效性约束(VC)作为DAW语言的新概念以缓解此问题。VC是一种约束,指定DAW执行在特定时刻必须满足的逻辑条件,从而保证执行有效性。当与DAW共同定义时,VC通过将隐性假设显式化,有助于提升DAW的可移植性、适应性和可重用性。一旦指定,VC可由DAW基础设施自动检查,违反约束可生成有意义的错误消息并触发优雅行为(如终止或调用修复机制)。我们提供了广泛的VC类型列表,从多个维度对其进行分类,并与相关领域中的类似概念进行比较。此外,我们还初步勾画了在现有DAW基础设施中实现VC的方案。