Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories

High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.

翻译：高质量视频数据集是训练动作识别、阶段检测和事件分割等任务中鲁棒模型的基础。然而，许多现实世界视频数据集存在标注错误，例如*误标*（片段被分配了错误的类别标签）和*时序错乱*（时间序列未遵循正确的进展顺序）。这些错误在阶段标注任务中尤为有害，因为此类任务对时序一致性要求极高。本文提出一种新颖的、与模型无关的标注错误检测方法，通过分析累积样本损失（CSL）——定义为视频帧在训练过程中各检查点通过时产生的平均损失——来实现检测。这种逐帧损失轨迹可作为帧级可学习性的动态指纹。误标或时序错乱的帧往往呈现持续高损失或不规则的损失模式，因为它们在训练全程都难以被模型学习；而正确标注的帧通常能早期收敛至低损失。为计算CSL，我们训练一个视频分割模型并在每个训练周期保存其权重。这些检查点随后被用于评估测试视频中每一帧的损失。具有持续高CSL的帧被标记为可能的标注错误候选，包括误标或时序错位。本方法无需标注错误的真实标签，且可跨数据集泛化。在EgoPER和Cholec80数据集上的实验展示了强大的检测性能，能有效识别误标和帧序列错乱等细微不一致性。所提出的方法为数据集审计和提升视频机器学习训练可靠性提供了有力工具。