中国人民大学未来金融创新工程中心 (CEIF)

2026-03

中国A 股上市公司行业分类数据集构建——基于大语言模型的方法

作者: 吴轲、应镇焜、钱宗鑫、周德馨

摘要: 行业分类是金融经济学实证研究的基础性工具，但现有中国A 股市场多套行业分类标准普遍存在更新滞后、区分度不足等问题。本文基于2007 至2023 年52702 份A 股上市公司年报“管理层讨论与分析”（MD&A）文本，利用大语言模型的文本嵌入能力与层次聚合聚类算法，构建了一套涵盖26 个一级、102 个二级和271 个三级行业的中国A 股上市公司行业分类数据集。实证结果显示，该分类体系在行业间差异性和行业内相似性两个维度上均显著优于中上协、申万和万得等主流分类标准，表明其能够更有效地实现“类内相似、类间差异”的分类目标。拓展性分析表明，基于LLM 分类构造的领先-滞后对冲投资组合能够产生统计显著的月度平均收益，并且在Fama-French 五因子和中国四因子模型调整后仍然显著；Fama-MacBeth 回归进一步证实，LLM 分类在捕获高价股同行业动量效应方面具有最强的预测能力，为该分类体系的准确性提供了基于资产定价的证据。本文为中国A 股市场提供了一套分类精准、数据驱动、可动态更新的行业分类框架，为公司金融、资产定价等实证研究提供新的分析工具。

2026-05

“人大-新华”中国A股上市公司行业分类数据集

作者: 吴轲、应镇焜、钱宗鑫、周德馨

摘要: “人大-新华”A股上市公司行业分类数据集已于2026年3月正式发布并在新华财经数据终端上线，本数据集仅供学术研究及非商业用途使用。任何使用本数据集的研究成果（包括但不限于学术论文、研究报告、工作论文等），均需在参考文献中引用本文。

2026-04

A New Approach to Connecting the Dividend-Price Ratio and Stock Returns

作者: Wenting Liao

摘要: There is a long-lasting debate on the performance of dividend-price ratio on the stock returns predictability. Most of the literature argues that the predictability decreases after the 1990s. Since Campbell-Shiller decomposition shows that the dividendprice ratio contains the information of both the future returns and future dividend growth, a linear predictive regression of stock returns on the dividend-price ratio may generate biased results due to the measurement error or omitted variables issues. Therefore, this paper proposes a new approach to study the nonlinear Granger causality of dividend-price ratio on stock returns. We conduct an unobserved component model and connect stock returns and the dividend-price ratio through their innovations. We show that our model can be represented by a reduced-form ARMAX process, and it can increase the in-sample predictability for all the sample periods.

2026-03

One-Shot Traversal for Low-Latency SSD-based Tree Index

作者: Yiheng Tong, Minhui Xie, Yuanhui Luo, Jing Liu, Ran Shu, Yongqiang Xiong, Yunpeng Chai

摘要: SSD-based tree indexes (e.g., B+tree) have been widely adoptted in storage and database systems. Driven by modern high-performance SSDs, much prior work has focused on improving their overall throughput. However, when it comes to latency, such approaches still suffer from the inherent bottleneck of pointer chasing during index traversals, so query latency does not benefit directly from faster SSDs. In this paper, we propose Shortcut, a lightweight solution that transforms the conventional wisdom of inherent access dependency into one-shot traversal, thereby achieving low query latency.
Our key idea is to exploit intra-query parallelism by training per-level learned indexes as B+tree companion to predict the traversal path in advance, and issuing multiple concurrent I/O requests to prefetch all target nodes in a one-shot manner. The main challenge we deal with is the excessive memory overhead of the additional learned index, which originates not from the machine learning (ML) models, but from two essential structures in it: the key array (for last-mile search) and the value array (termed “mapping array”, for mapping model predictions to arbitrary node locations in the file). Shortcut eliminates two arrays through two techniques: Keyless Learning Method and Phantom Mapping Mechanism, and achieves nearly-zero memory overhead (∼0.6%). Evaluations on YCSB and TPC-C show that Shortcut can reduce end-toend query latency by 26.2% to 64.8%.

2026-03

Constructing Parameter-Optimal Graph ANNS Index with GArena

作者: Puqing Wu, Minhui Xie, Hao Guo, Jie Yin, Sen Yang, Youyou Lu, Yunpeng Chai

摘要: Graph-based Approximate Nearest Neighbor Search (GANNS) delivers exceptional query performance compared to other algorithms, but this efficiency comes at the cost of significantly longer construction time. Because index construction parameters can alter final query throughput by up to an order of magnitude, identifying the parameter-optimal configuration is essential for real-world deployments. Yet, this process is notoriously challenging: traditional tuning methods can take days or weeks due to the need for repeated, full graph rebuilds.
This paper introduces GArena, a GANNS constructor that accelerates optimal construct parameter search. Our key observation is that while different parameter settings yield distinct graph topologies, their construction processes exhibit over 70% redundancy in distance calculations. Though a naive
distance cache can eliminate this redundancy, it incurs prohibitive memory overhead (dozens of TB). Instead, GArena redesigns the tuning paradigm by launching multiple parallel graphs with different configurations and aligning their distance computation trajectories, thereby greatly enhancing the distance cache’s temporal locality. GArena can bound the cache size to only a few MB (fully resident in CPU cache). GArena also introduces a performance-potentialguided pruner to discard unpromising configurations early on small subsets. Evaluation shows that GArena can dramatically shrink the tuning process from a day down to minutes, while consistently identifying the optimal configuration.

2026-03

Accelerating Ephemeral Approximate Nearest Neighbor Search by Progressive Index Construction

作者: Minhui Xie, Enrui Zhao, Yaxin Ma, Puqing Wu, Baotong Lu, Yuanhui Luo, Yongqiang Xiong, Jing Wang, Yunpeng Chai

摘要: Emerging applications like AI chatbots, code assistants, and agentic workflows have created a growing need for ephemeral Approximate Nearest Neighbor Search (ANNS), where an ANN index must be constructed online over pre-unknown, ad-hoc, short-lived datasets. Traditional ANNS methods, designed for offline index construction on pre-known datasets, are ill-suited for such scenario: the monolithic, upfront index construction process imposes substantial latency on the user’s critical path, degrading the interactive experience.

This paper presents FleetANN, a system that accelerates ephemeral ANNS by pioneering a progressive index construction paradigm. FleetANN logically partitions the dataset into an already-indexed component (I-component) and an unindexed brute-force component (BF-component), separated by a conceptual cursor. In the background, FleetANN continuously advances this cursor by migrating vectors from BF-component into I-component, incrementally building the index. In the foreground, FleetANN can serve user queries immediately via a hybrid retrieval strategy, ensuring theoretically guaranteed recalls even with a partially constructed index. To mitigate the initial high cost of brute-force search, FleetANN introduces a history-guided pruning technique that exploits distance information from past queries to avoid unnecessary computations. Evaluation shows that FleetANN can avoid costly initial construction stall (up to hundreds of seconds) while ultimately achieving the same or even better query performance as a full ANNS index.

2026-03

Joint Auction in the Online Advertising Market

作者: Zhen Zhang、Weian Li、Yahui Lei、Bingzhe Wang、Zhicheng Zhang、Qi Qi、Qiang Liu、Xingxing Wang

摘要: Online advertising is a primary source of income for e-commerce platforms. In the current advertising pattern, the oriented targets are the online store owners who are willing to pay extra fees to enhance the position of their stores. On the other hand, brand suppliers are also desirable to advertise their products in stores to boost brand sales. However, the currently used advertising mode cannot satisfy the demand of both stores and brand suppliers simultaneously. To address this, we innovatively propose a joint advertising model termed “Joint Auction”, allowing brand suppliers and stores to collaboratively bid for advertising slots, catering to both their needs. However, conventional advertising auction mechanisms are not suitable for this novel scenario. In this paper, we propose JRegNet, a neural network architecture for the optimal joint auction design, to generate mechanisms that can achieve the
optimal revenue and guarantee (near-)dominant strategy incentive compatibility and individual rationality. Finally, multiple experiments are conducted on synthetic and real data to demonstrate that our proposed joint auction significantly improves platform’s revenue compared to the known baselines.

未来金融创新工程中心

科研成果

工作论文

中国A 股上市公司行业分类数据集构建——基于大语言模型的方法

“人大-新华”中国A股上市公司行业分类数据集

A New Approach to Connecting the Dividend-Price Ratio and Stock Returns

One-Shot Traversal for Low-Latency SSD-based Tree Index

Constructing Parameter-Optimal Graph ANNS Index with GArena

Accelerating Ephemeral Approximate Nearest Neighbor Search by Progressive Index Construction

Joint Auction in the Online Advertising Market