Deep Reinforcement Learning (DRL), a subset of machine learning focused on sequential decision-making, has emerged as a powerful approach for tackling financial trading problems. In finance, DRL is commonly used either to generate discrete trade signals or to determine continuous portfolio allocations. In this work, we propose a novel reinforcement learning framework for portfolio optimization that incorporates Physics-Informed Kolmogorov-Arnold Networks (PIKANs) into several DRL algorithms. The approach replaces conventional multilayer perceptrons with Kolmogorov-Arnold Networks (KANs) in both actor and critic components-utilizing learnable B-spline univariate functions to achieve parameter-efficient and more interpretable function approximation. During actor updates, we introduce a physics-informed regularization loss that promotes second-order temporal consistency between observed return dynamics and the action-induced portfolio adjustments. The proposed framework is evaluated across three equity markets-China, Vietnam, and the United States, covering both emerging and developed economies. Across all three markets, PIKAN-based agents consistently deliver higher cumulative and annualized returns, superior Sharpe and Calmar ratios, and more favorable drawdown characteristics compared to both standard DRL baselines and classical online portfolio-selection methods. This yields more stable training, higher Sharpe ratios, and superior performance compared to traditional DRL counterparts. The approach is particularly valuable in highly dynamic and noisy financial markets, where conventional DRL often suffers from instability and poor generalization.
翻译:深度强化学习(DRL)作为专注于序列决策的机器学习分支,已成为解决金融交易问题的有力方法。在金融领域,DRL通常用于生成离散交易信号或确定连续资产配置。本研究提出一种用于投资组合优化的新型强化学习框架,将物理信息Kolmogorov-Arnold网络(PIKAN)集成至多种DRL算法中。该方法在行动者与评论家组件中均以Kolmogorov-Arnold网络(KAN)替代传统多层感知机——通过可学习的B样条单变量函数实现参数高效且更具可解释性的函数逼近。在行动者更新阶段,我们引入物理信息正则化损失项,以促进观测收益动态与行动引发的投资组合调整之间的二阶时间一致性。所提框架在中国、越南和美国三个股票市场进行评估,涵盖新兴与发达经济体。在所有三个市场中,基于PIKAN的智能体相较于标准DRL基线方法与经典在线投资组合选择策略,始终展现出更高的累计收益率与年化收益率、更优的夏普比率与卡尔玛比率,以及更有利的回撤特性。相比传统DRL方法,该框架实现了更稳定的训练过程、更高的夏普比率与更卓越的性能表现。该方法在高度动态且充满噪声的金融市场中尤其具有价值,而传统DRL在此类市场常面临稳定性不足与泛化能力欠佳的问题。