Deep Reinforcement Learning (DRL), a subset of machine learning focused on sequential decision-making, has emerged as a powerful approach for tackling financial trading problems. In finance, DRL is commonly used either to generate discrete trade signals or to determine continuous portfolio allocations. In this work, we propose a novel reinforcement learning framework for portfolio optimization that incorporates Physics-Informed Kolmogorov-Arnold Networks (PIKANs) into several DRL algorithms. The approach replaces conventional multilayer perceptrons with Kolmogorov-Arnold Networks (KANs) in both actor and critic components-utilizing learnable B-spline univariate functions to achieve parameter-efficient and more interpretable function approximation. During actor updates, we introduce a physics-informed regularization loss that promotes second-order temporal consistency between observed return dynamics and the action-induced portfolio adjustments. The proposed framework is evaluated across three equity markets-China, Vietnam, and the United States, covering both emerging and developed economies. Across all three markets, PIKAN-based agents consistently deliver higher cumulative and annualized returns, superior Sharpe and Calmar ratios, and more favorable drawdown characteristics compared to both standard DRL baselines and classical online portfolio-selection methods. This yields more stable training, higher Sharpe ratios, and superior performance compared to traditional DRL counterparts. The approach is particularly valuable in highly dynamic and noisy financial markets, where conventional DRL often suffers from instability and poor generalization.
翻译:深度强化学习(DRL)作为机器学习中专注于序列决策的子领域,已成为解决金融交易问题的有力方法。在金融领域,DRL通常用于生成离散交易信号或确定连续的投资组合配置。本研究提出一种新颖的强化学习框架用于投资组合优化,该框架将物理信息Kolmogorov-Arnold网络(PIKANs)整合到多种DRL算法中。该方法在行动者与评论家组件中均以Kolmogorov-Arnold网络(KANs)替代传统的多层感知器——通过可学习的B样条单变量函数实现参数高效且更具可解释性的函数逼近。在行动者更新阶段,我们引入物理信息正则化损失项,以促进观测收益动态与行动引发的投资组合调整之间的二阶时间一致性。所提框架在中国、越南和美国三个股票市场进行评估,涵盖新兴与发达经济体。在所有三个市场中,基于PIKAN的智能体相较于标准DRL基线方法与经典在线投资组合选择策略,始终展现出更高的累计收益与年化收益、更优的夏普比率与卡尔玛比率,以及更有利的回撤特性。与传统DRL方法相比,该框架实现了更稳定的训练过程、更高的夏普比率及更卓越的性能表现。该方法在高度动态且充满噪声的金融市场中尤其具有价值,因为传统DRL在此类环境中常面临稳定性不足与泛化能力差的问题。