Adversarial imitation learning (AIL) achieves high-quality imitation by mitigating compounding errors inherent to behavioral cloning (BC), yet its adversarial optimization frequently leads to training instability. A class of non-adversarial Q-based imitation learning (IL) methods, exemplified by IQ-Learn, has emerged to address this instability and is widely believed to outperform BC by leveraging online environment interactions. In this paper, we revisit IQ-Learn and prove that it in fact reduces to BC: it admits an imitation gap lower bound with quadratic dependence on the horizon and therefore remains susceptible to compounding errors. Our theoretical analysis reveals why online interactions fail to help: IQ-Learn uniformly suppresses Q-values for all actions at states not covered by demonstrations, preventing generalization beyond demonstrations. To address this fundamental limitation, we introduce Dual Q-DM, a new Q-based IL method built on Bellman constraints. Crucially, Bellman constraints drive value flow: Q-values propagate from demonstrated to unvisited states through environment dynamics, enabling generalization beyond demonstrations. We prove that Dual Q-DM is equivalent to AIL and can recover expert actions at unvisited states, thereby mitigating compounding errors. To the best of our knowledge, Dual Q-DM is the first non-adversarial IL method that is theoretically guaranteed to eliminate compounding errors. Experimental results further corroborate our theoretical findings.
翻译:暂无翻译