We consider the problem of learning-based man-in-the-middle (MITM) attacks in cyber-physical systems (CPS), and extend our previously proposed Bellman Deviation Detection (BDD) framework for model-free reinforcement learning (RL). We refine the standard MDP attack model by allowing the reward function to depend on both the current and subsequent states, thereby capturing reward variations induced by errors in the adversary's transition estimate. We also derive an optimal system-identification strategy for the adversary that minimizes detectable value deviations. Further, we prove that the agent's asymptotic learning time required to secure the system scales linearly with the adversary's learning time, and that this matches the optimal lower bound. Hence, the proposed detection scheme is order-optimal in detection efficiency. Finally, we extend the framework to asynchronous and intermittent attack scenarios, where reliable detection is preserved.
翻译:针对信息物理系统中基于学习的中间人攻击问题,本文对先前提出的贝尔曼偏差检测框架进行扩展,使其适用于无模型强化学习场景。通过允许奖励函数同时依赖当前状态与后续状态,我们改进了标准马尔可夫决策过程攻击模型,从而捕捉对手迁移估计误差导致的奖励变化。同时推导出对手的最优系统辨识策略,该策略能最小化可检测的价值偏差。进一步证明,系统安全所需的智能体渐进学习时间与对手学习时间呈线性关系,且该线性边界达到理论最优下界,表明所提检测方案在检测效率上达到阶次最优。最后,我们将该框架扩展到异步与间歇性攻击场景,在此类场景中仍能保持可靠的检测性能。