To Optimize Human-in-the-loop Learning in Repeated Routing Games

Today navigation applications (e.g., Waze and Google Maps) enable human users to learn and share the latest traffic observations, yet such information sharing simply aids selfish users to predict and choose the shortest paths to jam each other. Prior routing game studies focus on myopic users in oversimplified one-shot scenarios to regulate selfish routing via information hiding or pricing mechanisms. For practical human-in-the-loop learning (HILL) in repeated routing games, we face non-myopic users of differential past observations and need new mechanisms (preferably non-monetary) to persuade users to adhere to the optimal path recommendations. We model the repeated routing game in a typical parallel transportation network, which generally contains one deterministic path and $N$ stochastic paths. We first prove that no matter under the information sharing mechanism in use or the latest routing literature's hiding mechanism, the resultant price of anarchy (PoA) for measuring the efficiency loss from social optimum can approach infinity, telling arbitrarily poor exploration-exploitation tradeoff over time. Then we propose a novel user-differential probabilistic recommendation (UPR) mechanism to differentiate and randomize path recommendations for users with differential learning histories. We prove that our UPR mechanism ensures interim individual rationality for all users and significantly reduces $\text{PoA}=\infty$ to close-to-optimal $\text{PoA}=1+\frac{1}{4N+3}$, which cannot be further reduced by any other non-monetary mechanism. In addition to theoretical analysis, we conduct extensive experiments using real-world datasets to generalize our routing graphs and validate the close-to-optimal performance of UPR mechanism.

翻译：当今导航应用（如Waze和Google Maps）使人类用户能够学习并共享最新交通观测数据，然而此类信息共享仅帮助自私用户预测并选择最短路径，最终导致相互拥堵。现有路由博弈研究主要关注过度简化的单次博弈场景中的短视用户，通过信息隐藏或定价机制调控自私路由行为。针对重复路由博弈中实际存在的人在回路学习过程，我们面临具有差异化历史观测数据的非短视用户，需要设计新型机制（最好是非货币性机制）以说服用户遵循最优路径推荐。我们在典型的平行交通网络中建立重复路由博弈模型，该网络通常包含一条确定性路径和$N$条随机性路径。首先证明无论采用现行信息共享机制还是最新路由文献中的隐藏机制，用于衡量社会最优效率损失的失谐代价均可能趋于无穷大，这表明随时间推移会呈现任意糟糕的探索-利用权衡。随后提出一种新颖的用户差异化概率推荐机制，该机制根据用户不同的学习历史实施差异化的随机路径推荐。理论证明该机制能确保所有用户的临时个体合理性，并将$\text{PoA}=\infty$显著降低至接近最优的$\text{PoA}=1+\frac{1}{4N+3}$，且该值无法通过任何其他非货币机制进一步降低。除理论分析外，我们利用真实世界数据集进行大量实验，拓展路由图结构并验证UPR机制接近最优的性能表现。