Imitation learning suffers from causal confusion. This phenomenon occurs when learned policies attend to features that do not causally influence the expert actions but are instead spuriously correlated. Causally confused agents produce low open-loop supervised loss but poor closed-loop performance upon deployment. We consider the problem of masking observed confounders in a disentangled representation of the observation space. Our novel masking algorithm leverages the usual ability to intervene in the initial system state, avoiding any requirement involving expert querying, expert reward functions, or causal graph specification. Under certain assumptions, we theoretically prove that this algorithm is conservative in the sense that it does not incorrectly mask observations that causally influence the expert; furthermore, intervening on the initial state serves to strictly reduce excess conservatism. The masking algorithm is applied to behavior cloning for two illustrative control systems: CartPole and Reacher.
翻译:模仿学习存在因果混淆问题。该现象表现为学习策略关注的某些特征与专家动作并无因果关联,仅存在虚假相关性。因果混淆的智能体虽然能获得较低的开放回路监督损失,但在实际部署中闭环性能表现不佳。我们研究了在观测空间解耦表示中遮蔽观测混淆变量的方法。所提出的新型遮蔽算法利用了对系统初始状态进行干预的常规能力,避免了专家查询、专家奖励函数或因果图规范等需求。在特定假设条件下,我们从理论上证明了该算法的保守性,即不会错误遮蔽对专家动作具有因果影响的观测变量;此外,对初始状态的干预可严格降低过度保守性。该遮蔽算法被应用于两种典型控制系统(CartPole和Reacher)的行为克隆。