Imitation learning suffers from causal confusion. This phenomenon occurs when learned policies attend to features that do not causally influence the expert actions but are instead spuriously correlated. Causally confused agents produce low open-loop supervised loss but poor closed-loop performance upon deployment. We consider the problem of masking observed confounders in a disentangled representation of the observation space. Our novel masking algorithm leverages the usual ability to intervene in the initial system state, avoiding any requirement involving expert querying, expert reward functions, or causal graph specification. Under certain assumptions, we theoretically prove that this algorithm is conservative in the sense that it does not incorrectly mask observations that causally influence the expert; furthermore, intervening on the initial state serves to strictly reduce excess conservatism. The masking algorithm is applied to behavior cloning for two illustrative control systems: CartPole and Reacher.
翻译:模仿学习存在因果混淆问题,当学习策略关注与专家行为无因果影响而仅存在虚假相关性的特征时,便会出现该现象。因果混淆导致策略的开放环监督损失较低,但部署后闭环性能表现不佳。我们考虑在观测空间的解缠表示中屏蔽观测到的混杂因素这一问题。所提出的新型屏蔽算法利用了干预初始系统状态的常规能力,无需涉及专家查询、专家奖励函数或因果图规范。在特定假设下,我们理论证明该算法具有保守性,即不会错误屏蔽对专家行为具有因果影响的观测;此外,对初始状态进行干预可严格降低过度保守性。该屏蔽算法被应用于两类典型控制系统的行为克隆:CartPole与Reacher。