Online Learning of Strategic Defense against Ecological Adversaries under Partial Observability with Semi-Bandit Feedback

We introduce an online learning algorithm for computing adaptive resource allocation policies against strategic ecological adversaries with unknown behavioral models and partial observability. Our setting addresses a fundamental limitation of security games: when adversary behavior cannot be modeled a priori, classical equilibrium-based approaches fail. We formulate the problem as regret minimization in a combinatorial action space with semi-bandit feedback, where payoffs are non-stationary and interdependent across targets. Our algorithm, named HERDS (Human-Elephant conflict mitigation through Resource Deployment for Strategic guarding), extends Follow-the-Perturbed-Leader (FPL) with three innovations: (1) simultaneous exploration-exploitation through dynamic budget partitioning driven by observed losses, (2) adaptive payoff estimation under confounded observations where attack entry points are unidentifiable, and (3) model-agnostic learning that provides regret guarantees without behavioral assumptions. We demonstrate our framework on Human-Elephant Conflict mitigation, a domain where intelligent ecological adversaries exhibit strategic behavior (optimal foraging, spatial memory, adaptive evasion) yet lack tractable behavioral models. Experiments using an Agent-Based Model calibrated with elephant movement data demonstrate 15--45% regret reduction versus Follow-the-Perturbed-Leader with Uniform-Exploration (FPL-UE), 40--50% crop damage reduction against adaptive adversaries, and convergence in 40--50 rounds versus 60--80 for baselines.

翻译：本文提出了一种在线学习算法，用于计算针对行为模型未知且具有部分可观测性的战略性生态对手的自适应资源分配策略。我们的设定解决了安全博弈的一个根本局限：当对手行为无法先验建模时，经典的基于均衡的方法会失效。我们将该问题表述为具有半赌博反馈的组合动作空间中的遗憾最小化问题，其中收益是非平稳的且在目标间相互依赖。我们的算法名为HERDS（通过战略性守卫的资源部署缓解人象冲突），它通过三项创新扩展了跟随扰动领导者算法：（1）通过基于观测损失的动态预算分配实现同时探索与利用；（2）在攻击入口点不可识别的混杂观测条件下进行自适应收益估计；（3）无需行为假设即可提供遗憾保证的模型无关学习。我们在人象冲突缓解领域验证了该框架，该领域中的智能生态对手表现出战略性行为（最优觅食、空间记忆、自适应规避），但缺乏可处理的行为模型。使用基于大象运动数据校准的基于智能体模型进行的实验表明，相较于采用均匀探索的跟随扰动领导者算法，本算法实现了15-45%的遗憾降低，对自适应对手的作物损害减少了40-50%，并在40-50轮内收敛，而基线算法需要60-80轮。