CrossAlpha: An Annual-Report Benchmark for Cross-Market Factor Researc (with LLM Agents)

Cross-market factor research studies whether firm-level signals from one or more markets can predict returns in a target market, but existing public benchmarks do not support cross-market disclosure-to-return evaluation. Building such a benchmark is challenging because filings differ across languages and regulatory systems, disclosure-derived similarity can be biased by common reporting components, and cross-market signals must be evaluated under feasible trading-time alignment. We introduce \textbf{CrossAlpha}, a public annual-report benchmark for cross-market factor research. CrossAlpha addresses these challenges through three corresponding components: \emph{Disclosure Distillation}, which standardises heterogeneous filings into ten-category English business descriptions; \emph{Residual Schema Graph Construction}, which builds PCA-whitened cross-market firm-pair scores from schema-level disclosures; and \emph{Timing-Aligned Evaluation}, which pairs the graph with 11 years of daily OHLCV data to construct forward-return labels under feasible cross-market execution protocols. CrossAlpha covers about 3,600 firms and 10,700 firm-year reports from the United States, Japan, Taiwan, South Korea, and Hong Kong, and releases about 19M directed firm-pair scores. In experiments, disclosure-derived cross-market peers outperform domestic text, industry-code, and return-correlation peers in the US-to-Japan setting (ICIR 0.39 versus 0.07--0.18), and cross-market sources beat the domestic text baseline in most target markets. CrossAlpha offers an open-sourced, reusable, return-grounded benchmark for cross-market financial NLP.

翻译：跨市场因子研究旨在探究一个或多个市场中的公司层面信号能否预测目标市场的收益，但现有公开基准不支持跨市场披露-收益评估。构建此类基准面临多重挑战：不同语言和监管体系下的申报文件存在差异，基于披露的相似度可能因通用报告成分而产生偏差，跨市场信号必须在可行的交易时间对齐下进行评估。为此，我们提出\textbf{CrossAlpha}——面向跨市场因子研究的公开年度报告基准。CrossAlpha通过三大组件应对上述挑战：\emph{披露蒸馏}将异构申报文件标准化为十类英文业务描述；\emph{残差模式图构建}基于模式层面的披露信息生成经PCA白化的跨市场公司配对分数；\emph{时间对齐评估}将该图与11年日度OHLCV数据配对，在可行的跨市场执行协议下构造前向收益标签。CrossAlpha覆盖来自美国、日本、中国台湾、韩国和中国香港约3,600家公司的10,700份公司年度报告，并发布约1,900万个有向公司配对分数。实验表明，在美国对日本场景下，基于披露的跨市场可比公司优于国内文本、行业分类编码及收益相关性可比公司（ICIR 0.39对比0.07–0.18），且在多数目标市场中跨市场来源优于国内文本基线。CrossAlpha为跨市场金融NLP提供了可开源、可复用、基于收益的基准。