Game-Theoretic Modeling of Stealthy Intrusion Defense against MDP-Based Attackers

The rapid expansion of Internet use has increased system exposure to cyber threats, with advanced persistent threats (APTs) being especially challenging due to their stealth, prolonged duration, and multi-stage attacks targeting high-value assets. In this study, we model APT evolution as a strategic interaction between an attacker and a defender on an attack graph. With limited information about the attacker's position and progress, the defender acts at random intervals by deploying intrusion detection sensors across the network. Once a compromise is detected, affected components are immediately secured through measures such as backdoor removal, patching, or system reconfiguration. Meanwhile, the attacker begins with reconnaissance and then proceeds through the network, exploiting vulnerabilities and installing backdoors to maintain persistent access and adaptive movement. Furthermore, the attacker may take several steps between consecutive defensive operations, resulting in an asymmetric temporal dynamic. The defender's goal is to reduce the likelihood that the attacker will gain access to a critical asset, whereas the attacker's purpose is to increase this likelihood. We investigate this interaction under three informational regimes, reflecting varying levels of attacker knowledge prior to action: (i) a Stackelberg scenario, in which the attacker has full knowledge of the defender's strategy and can optimize accordingly; (ii) a blind regime, where the attacker has no information and assumes uniform beliefs about defensive deployments; and (iii) a belief-based framework, where the attacker holds accurate probabilistic beliefs about the defender's actions. For each regime, we derive optimal defensive strategies by solving the corresponding optimization problems.

翻译：互联网使用的迅速扩张增加了系统面临网络威胁的风险，其中高级持续性威胁（APTs）因其隐蔽性、持续时间长以及针对高价值资产的多阶段攻击特性而尤为棘手。本研究将APT演进建模为攻击者与防御者在攻击图上的策略交互。在关于攻击者位置和进展信息有限的情况下，防御者以随机时间间隔在网络中部署入侵检测传感器。一旦检测到入侵，立即通过后门清除、补丁修复或系统重构等措施对受影响组件进行安全加固。与此同时，攻击者从侦察阶段开始，逐步渗透网络，利用漏洞并安装后门以维持持久访问和自适应移动。此外，攻击者在连续防御操作之间可能执行多步攻击，形成非对称的时间动态。防御者的目标是降低攻击者获取关键资产访问权限的概率，而攻击者则旨在提高该概率。我们在三种信息机制下研究这种交互，这些机制反映了攻击者在行动前不同层次的知识水平：（i）Stackelberg博弈场景，攻击者完全掌握防御者策略并据此优化；（ii）盲态机制，攻击者无任何信息并假设防御部署服从均匀分布；（iii）基于信念的框架，攻击者持有关于防御者行为的准确概率信念。针对每种机制，我们通过求解相应的优化问题推导出最优防御策略。