We consider the problem of content caching at the wireless edge to serve a set of end users via unreliable wireless channels so as to minimize the average latency experienced by end users due to the constrained wireless edge cache capacity. We formulate this problem as a Markov decision process, or more specifically a restless multi-armed bandit problem, which is provably hard to solve. We begin by investigating a discounted counterpart, and prove that it admits an optimal policy of the threshold-type. We then show that this result also holds for average latency problem. Using this structural result, we establish the indexability of our problem, and employ the Whittle index policy to minimize average latency. Since system parameters such as content request rates and wireless channel conditions are often unknown and time-varying, we further develop a model-free reinforcement learning algorithm dubbed as Q^{+}-Whittle that relies on Whittle index policy. However, Q^{+}-Whittle requires to store the Q-function values for all state-action pairs, the number of which can be extremely large for wireless edge caching. To this end, we approximate the Q-function by a parameterized function class with a much smaller dimension, and further design a Q^{+}-Whittle algorithm with linear function approximation, which is called Q^{+}-Whittle-LFA. We provide a finite-time bound on the mean-square error of Q^{+}-Whittle-LFA. Simulation results using real traces demonstrate that Q^{+}-Whittle-LFA yields excellent empirical performance.
翻译:我们考虑无线边缘的内容缓存问题,旨在通过不可靠的无线信道服务一组终端用户,以最小化因无线边缘缓存容量受限而导致的用户平均延迟。我们将该问题建模为马尔可夫决策过程,更具体地说是可证明难以求解的 restless 多臂赌博机问题。我们首先研究其折扣版本的对应问题,并证明存在最优阈值型策略,进而证明该结论同样适用于平均延迟问题。基于这一结构性质,我们建立了问题的可索引性,并采用惠特尔索引策略来最小化平均延迟。由于内容请求率和无线信道条件等系统参数通常未知且时变,我们进一步提出了一种名为 Q^{+}-Whittle 的无模型强化学习算法,该算法依赖惠特尔索引策略。然而,Q^{+}-Whittle 需要存储所有状态-动作对的 Q 函数值,对于无线边缘缓存而言,其数量可能极为庞大。为此,我们使用参数化函数类对 Q 函数进行低维近似,并设计了带有线性函数逼近的 Q^{+}-Whittle 算法(称为 Q^{+}-Whittle-LFA)。我们给出了 Q^{+}-Whittle-LFA 均方误差的有限时间界。基于真实轨迹的仿真结果表明,Q^{+}-Whittle-LFA 具有优异的实证性能。