In this paper we present techniques to incrementally harvest and query arbitrary metadata from machine learning pipelines, without disrupting agile practices. We center our approach on the developer-favored technique for generating metadata -- log statements -- leveraging the fact that logging creates context. We show how hindsight logging allows such statements to be added and executed post-hoc, without requiring developer foresight. Relational views of incomplete metadata can be queried to dynamically materialize new metadata in bulk and on demand across multiple versions of workflows. This is done in a "metadata later" style, off the critical path of agile development. We realize these ideas in a system called FlorDB and demonstrate how the data context framework covers a range of both ad-hoc metadata as well as special cases treated today by bespoke feature stores and model repositories. Through a usage scenario -- including both ML and human feedback -- we illustrate how the component techniques come together to resolve classic software engineering trade-offs between agility and discipline.
翻译:本文提出了一种在不干扰敏捷实践的前提下,从机器学习流水线中增量式采集与查询任意元数据的技术。我们的方法以开发者偏好的元数据生成技术——日志语句——为核心,利用日志记录创建上下文的特性。我们展示了后见日志技术如何允许此类语句在事后添加与执行,无需开发者的预先规划。不完整元数据的关系视图可被查询,从而跨工作流的多个版本动态、批量、按需地物化新的元数据。这一过程以“元数据后置”的风格实现,脱离了敏捷开发的关键路径。我们在名为FlorDB的系统中实现了这些理念,并展示了数据上下文框架如何覆盖从临时元数据到当前由定制化特征存储和模型仓库处理的各类特例。通过一个包含机器学习与人工反馈的使用场景,我们阐释了各组件技术如何协同工作,以解决敏捷性与规范性之间的经典软件工程权衡。