Modern machine learning systems are increasingly realised as multistage pipelines, yet existing transparency mechanisms typically operate at a model level: they describe what a system is and why it behaves as it does, but not how individual data samples are operationally recorded, tracked, and verified as they traverse the pipeline. This absence of verifiable, sample-level traceability leaves practitioners and users unable to determine whether a specific sample was used, when it was processed, or whether the corresponding records remain intact over time. We introduce FG-Trac, a model-agnostic framework that establishes verifiable, fine-grained sample-level traceability throughout machine learning pipelines. FG-Trac defines an explicit mechanism for capturing and verifying sample lifecycle events across preprocessing and training, computes contribution scores explicitly grounded in training checkpoints, and anchors these traces to tamper-evident cryptographic commitments. The framework integrates without modifying model architectures or training objectives, reconstructing complete and auditable data-usage histories with practical computational overhead. Experiments on a canonical convolutional neural network and a multimodal graph learning pipeline demonstrate that FG-Trac preserves predictive performance while enabling machine learning systems to furnish verifiable evidence of how individual samples were used and propagated during model execution.
翻译:现代机器学习系统日益以多阶段流水线形式实现,然而现有的透明性机制通常仅在模型层面运作:它们描述系统是什么及其行为原因,但未记录数据样本在流水线中如何被操作性地记录、追踪和验证。这种可验证的样本级可追溯性的缺失,使得从业者和用户无法确定特定样本是否被使用、何时被处理,或相应记录是否随时间保持完整。我们提出FG-Trac——一个与模型无关的框架,可在整个机器学习流水线中建立可验证的细粒度样本级可追溯性。FG-Trac定义了明确的机制来捕获和验证跨预处理与训练阶段的样本生命周期事件,基于训练检查点显式计算贡献分数,并将这些追溯记录锚定至防篡改的密码学承诺。该框架无需修改模型架构或训练目标即可集成,能以实际计算开销重建完整且可审计的数据使用历史。在经典卷积神经网络和多模态图学习流水线上的实验表明,FG-Trac在保持预测性能的同时,使机器学习系统能够提供关于单个样本在模型执行过程中如何被使用和传播的可验证证据。