Paper on machine learning approach to nominative record linkage in Chinese historical sources

Yue YU lead-authored a new paper introducing a machine-learning approach to nominative record linkage in Chinese historical sources that is now online at Historical Methods. It was co-authored with Cameron Campbell, Yueran Hou and Yibei Wu. The paper is titled “A machine learning approach for nominative record linkage in Chinese historical databases.” It is a revised version of the working paper that we previously uploaded at SocArXiv. It demonstrates that a machine-learning approach yields substantial improvements over the probabilistic approach introduced in a paper by Cameron Campbell and Bijia Chen.

Reference:

YU Yue, Yueran Hou, Yibei Wu, Cameron Campbell. 2026. A Machine Learning Approach for Nominative Record Linkage in Chinese Historical Databases. (w/ Yue Yu, Yueran Hou, Yibei Wu). Historical Methods. Online access, 26 March 2026: https://www.tandfonline.com/doi/full/10.1080/01615440.2026.2641454

Abstract

We introduce a generic machine learning-based pipeline for nominative linkage of records within and across Chinese historical datasets. The pipeline addresses key challenges, including character variations, incomplete data, and scalability issues specific to historical datasets in which names and other attributes are recorded with Chinese characters, not just for China, but potentially for Korea, Japan and Vietnam. Techniques developed for attributes recorded in phonetic alphabets are of limited use for Chinese characters not only because homonyms are common, but characters that are similar enough in appearance to be mistaken for each other may sound different. Our approach integrates stroke-based character embeddings for efficient blocking, supervised classification with active learning for record matching, and graph-based clustering for final linkage. We demonstrate the effectiveness of this pipeline using the career records of officials in the China Government Employee Database-Qing Jinshenlu (CGED-Q JSL). We achieve improved linkage quality compared to standard probabilistic methods, with longer linked sequences of career records and fewer aberrant transitions. To validate the generalizability, we also successfully apply the pipeline to another database and a cross-database linkage task. By minimizing the need for manual tuning, our pipeline offers a more accessible and effective solution for Chinese historical data linkage.