Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.
翻译:时间序列(TS)推理模型(TSRMs)在通用领域中展现出良好的能力,但在具有独特特征的金融领域却屡屡失效。我们提出了一个通用的2×2能力分类法,用于TSRMs,该分类法交叉了1)单一实体与多实体分析,以及2)当前状态评估与未来行为预测。我们将此分类法实例化于金融领域——其中确定性评估与随机性预测之间的区别尤为关键——形成十个金融推理任务,并基于标普500股票构建了FinTSR-Bench基准。为此,我们提出了FinSTaR(金融时间序列思考与推理),该模型在FinTSR-Bench上训练,并结合了针对每个类别量身定制的不同思维链(CoT)策略。对于评估任务,其为确定性的(即可从可观测数据计算得出),我们采用计算式思维链(Compute-in-CoT),一种程序化的思维链,使模型能够直接从原始价格推导出答案。对于预测任务,其本质上是随机性的(即受不可观测因素影响),我们采用情景感知思维链(Scenario-Aware CoT),在做出判断前生成多种情景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到了78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们证明了这四个能力类别通过联合训练具有互补性和相互增强性,并且情景感知思维链相比标准思维链持续提升了预测准确率。代码已在 https://github.com/seunghan96/FinSTaR 上开源。