Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning

In the coded caching, the server uses the cached information at the users to serve multiple users in parallel with a single coded multi-casting message or packet, that is, a merged packet, and thus mitigates the peak network congestion. In order to deliver the timely messages to the users in the deadline-driven applications like the video streaming, we must determine online the messages to be merged for the delivery, as there is a time limit for each request. It is important to note that while the merging aids the current coded multi-casting packet, it could harm the future deliveries. Our solution employs the deep reinforcement learning to view the coded multi-casting delivery as a masked action-discrete state control problem, and our policy network, trained via the proximal policy optimization, performs better than SACM++. On the uniform-demand benchmark, our policy network reduces the broadcast-packet expiration ratio $ρ$ by $40.9\%$ ($0.208$ vs.\ $0.352$) with respect to the best coded multi-casting baseline (SACM++), while also attaining the best broadcast-efficiency score $σ$ across the Track~A battery among the coded multi-casting methods. One noteworthy phenomenon here is that, for the applications with stricter deadlines, the merging becomes selective instead of aggressive, since the policy network selectively merges at approximately $31.8\%$ of the chances, even though the same observation holds across the variations within the same simulator family. The focus of our design is on the efficient pairwise XOR merging, where the higher-order ($K{\ge}3$) coding can be considered as a natural generalization left for future work.

翻译：在编码缓存中，服务器利用用户端的缓存信息，通过单个编码多播消息或数据包（即合并数据包）并行服务多个用户，从而缓解峰值网络拥塞。为在视频流等截止时间驱动的应用中向用户及时传递消息，我们必须在线确定待合并传输的消息，因为每个请求都有时间限制。需特别注意的是，虽然合并有助于当前编码多播数据包，但可能损害未来传输。我们的解决方案采用深度强化学习，将编码多播传输视为掩码动作-离散状态控制问题，并通过近端策略优化训练的策略网络性能优于SACM++。在均匀需求基准测试中，相比最优编码多播基线（SACM++），我们的策略网络将广播数据包过期率$ρ$降低了$40.9\%$（$0.208$对比$0.352$），同时还在Track~A测试集上取得了编码多播方法中最佳的广播效率得分$σ$。值得关注的现象是，对于截止时间更严格的应用，合并策略会从激进型转变为选择型——策略网络仅在约$31.8\%$的机会中进行选择性合并，即便同一模拟器家族内的变体也呈现相同规律。本设计重点在于高效的成对XOR合并，而高阶（$K{\ge}3$）编码可视为自然扩展，留待未来研究。