Behavior cloning (BC) optimizes policies by treating human demonstrations as pointwise action labels. While effective with accurate action labels, this formulation is brittle in practice: when human-provided actions are imperfect, treating each label as an exact target can steer the policy away from the underlying desired behavior, particularly when expressive models are used (e.g., energy-based models). As a result, we propose a human-in-the-loop alternative that replaces pointwise supervision with set-valued action targets. We introduce Contrastive policy Learning from Interactive Corrections (CLIC). CLIC leverages human corrections to construct and refine sets of desired actions, and optimizes a policy to place probability mass over these sets rather than over a single action target. This formulation naturally accommodates both absolute and relative corrections and can represent complex multi-modal behaviors. Extensive simulation and real-robot experiments show that the proposed approach leads to effective policy learning across diverse settings: CLIC remains competitive with the state of the art under accurate data while being substantially more robust under noisy, relative, and partial feedback. Our implementation is publicly available at https://clic-webpage.github.io/.
翻译:行为克隆(BC)通过将人类演示视为逐点动作标签来优化策略。虽然这种方法在动作标签准确时有效,但在实际应用中较为脆弱:当人类提供的动作不完美时,将每个标签视为精确的目标可能导致策略偏离潜在的期望行为,特别是在使用表达能力强的模型(如基于能量的模型)时。为此,我们提出了一种带有人类参与循环的替代方案,将逐点监督替换为集合值的动作目标。我们引入了通过交互式纠正进行对比策略学习(CLIC)。CLIC利用人类纠正来构建和优化期望动作集合,并优化策略以将概率质量置于这些集合上,而非单一动作目标上。这种公式自然适用于绝对纠正和相对纠正,并能够表示复杂的多模态行为。大量的仿真和真实机器人实验表明,所提出的方法能够在多种设置下实现有效的策略学习:在数据准确的情况下,CLIC保持与现有技术相当的水平,同时在存在噪声、相对和部分反馈时表现出更强的鲁棒性。我们的实现已在https://clic-webpage.github.io/公开发布。