Cross-market factor research studies whether firm-level signals from one or more markets can predict returns in a target market, but existing public benchmarks do not support cross-market disclosure-to-return evaluation. Building such a benchmark is challenging because filings differ across languages and regulatory systems, disclosure-derived similarity can be biased by common reporting components, and cross-market signals must be evaluated under feasible trading-time alignment. We introduce \textbf{CrossAlpha}, a public annual-report benchmark for cross-market factor research. CrossAlpha addresses these challenges through three corresponding components: \emph{Disclosure Distillation}, which standardises heterogeneous filings into ten-category English business descriptions; \emph{Residual Schema Graph Construction}, which builds PCA-whitened cross-market firm-pair scores from schema-level disclosures; and \emph{Timing-Aligned Evaluation}, which pairs the graph with 11 years of daily OHLCV data to construct forward-return labels under feasible cross-market execution protocols. CrossAlpha covers about 3,600 firms and 10,700 firm-year reports from the United States, Japan, Taiwan, South Korea, and Hong Kong, and releases about 19M directed firm-pair scores. In experiments, disclosure-derived cross-market peers outperform domestic text, industry-code, and return-correlation peers in the US-to-Japan setting (ICIR 0.39 versus 0.07--0.18), and cross-market sources beat the domestic text baseline in most target markets. CrossAlpha offers an open-sourced, reusable, return-grounded benchmark for cross-market financial NLP.
翻译:跨市场因子研究探讨来自一个或多个市场的公司层面信号能否预测目标市场的收益,但现有公开基准不支持跨市场披露-收益评估。构建此类基准面临挑战:各市场年报因语言与监管体系差异而不同;披露衍生的相似性可能受通用报告组件偏差影响;跨市场信号必须依据可行的交易时间对齐进行评价。我们提出\textbf{CrossAlpha}——用于跨市场因子研究的公开年报基准。CrossAlpha通过三个对应组件应对上述挑战:\emph{披露蒸馏},将异构年报标准化为十大类英文业务描述;\emph{残差模式图构建},基于披露模式构建经PCA白化处理的跨市场公司对评分;\emph{时间对齐评估},将图与11年日频OHLCV数据配对,在可行跨市场执行协议下生成远期收益标签。CrossAlpha覆盖美国、日本、台湾、韩国和香港约3600家公司、10700份公司年报,并发布约1900万有向公司对评分。实验表明,在美-日设定中,披露衍生的跨市场同行优于国内文本、行业代码和收益相关性同行(ICIR 0.39对比0.07–0.18),且跨市场源在多数目标市场中优于国内文本基线。CrossAlpha为跨市场金融NLP提供了开源、可复用且以收益为锚的基准。