VoxEU Column Frontiers of economic research

The promise of automated historical data linkage

14 Feb 2020

A number of vital questions in the social sciences, relating to intergenerational mobility or assimilation of migrants for example, require data that follow individuals over time. The recent digitisation of historical population censuses for the US and other countries has increased their availability, but linking such historical data is challenging. This column compares the performance of various linking methods and concludes that automated methods perform no worse on key dimensions than (more expensive) hand linking using standard linking variables.

Authors

Santiago Pérez

James J Feigenbaum

Leah Boustan

Katherine Eriksson

Ran Abramitzky

A number of vital questions in the social sciences require having data that follow individuals throughout time – from childhood to adulthood, for example, or over the course of a worker’s career. Such questions include: How has the rate of intergenerational mobility – i.e. the degree to which children's adult outcomes are linked to their parents – changed over time? How quickly do immigrants assimilate, economically or socially, as they spend more time in their destination countries, both in the past and today? Can social safety net programmes have effects on covered children in the very long run, affecting them as adults or even their longevity? Data that follow individuals over time (which are often referred to as panel or longitudinal data) are expensive to collect and, as a result, relatively rare to find on a large scale, especially for historical periods.

The recent digitisation of historical population censuses for the US and other countries has allowed researchers to create exactly these type of data at a large scale. Because the individual-level records (including a person’s name) are released to the public after 72 years (in the case of the US), it is possible to leverage this information to follow individuals across different datasets. Recent examples of papers that investigate these questions using linked data include Long and Ferrie (2013) Collins and Wanamaker (2014), Aizer et al. (2016), Eriksson (2018), Feigenbaum (2018), and Abramitzky et al. (2019).

However, linking historical data introduces particular challenges. In a recent paper (Abramitzky et al. 2019), we evaluate the performance of various linking methods, suggest best practices for linking historical records, and provide user-friendly implementation codes.

One challenge we address is the prevalence in historical data of common names, along with transcription and enumeration errors, age misreporting, mortality, under-enumeration and international migration between census years. Because historical data lack unique identifiers such as a Social Security Number, this often make it impossible to find the correct match with certainty. Finding the same individual in two datasets requires using characteristics such as first and last name, reported age, and birthplace.

Consider the following (real-world) linking problem: you want to match a man who is listed in the 1915 Iowa census as ‘Paul Coulter, three years old, born in Kansas’ to the 1940 US Federal census. After much searching, you find two possible candidates in the 1940 census: ‘Paul Coater, 28 years, born in Kansas’, and ‘Paul Courter, 29 years old, born in Kansas’. Who, if anyone, would you choose as the correct match?

In the face of this challenge, a method for matching records should aim to satisfy four goals:

First, it should be accurate, making as few false matches as possible.
Second, it should be efficient, creating as many of the true matches as possible.
Third, it should be representative, generating linked samples that resemble the population of interest as closely as possible.
Fourth, the method should be feasible for most scholars to implement given current limitations of computing power and resources.

Another goal of our paper is to compare the performance of automated linking algorithms and manual linking. Linking by hand has the advantage that we instinctively trust other humans more than we trust computer algorithms, but hand linking is expensive, non-replicable, and impractical (consider linking the entire US population across two censuses by hand). In contrast, automated linking is rules-based, cheap, and replicable, but the algorithms may not match human performance. Ultimately, when computers and humans use the same information, it is an empirical question: Do these approaches produce similar links and which method performs better, relative to some benchmark?

One way to test for the accuracy of automated methods is to compare the links created by various algorithms with genealogical hand linkages. While we use high-quality hand links, we hesitate to call these links ‘ground truth’ because there is no way to know for sure what the true links are in this case. In a first exercise, we asked the Record Linking Lab at Brigham Young University (BYU) to check the quality of the links made via a few common automated methods by comparing them to high-quality hand links made by genealogists and users of the website FamilySearch.org. The Record Linkage Lab found that links created by standard automated algorithms agree with users of the Family Tree data in more than 95% of cases. This implies a rate of discrepancy between the genealogical links made by humans and those made by the automated methods of less than 5%.

In a second exercise, we compared the links created by automated algorithms to a dataset linking the Union Army Records to the 1900 US census, which was carefully (and expensively) hand collected using trained research assistants who had access to extra information, such as spouses’ and children’s names, not typically available for linking (Costa et al. 2017). Treating these data as a benchmark, we compare the relative performance of automated and hand linking methods that use only the standard linking variables (that is, names, year of birth, and state or country of birth) in creating the linked samples.

Figure 1 illustrates the trade-off between matching a large fraction of records and matching the right pair of individuals. Researchers can choose to use algorithms that generate very low ‘discrepancy rates’ from these high-quality benchmark links, which we refer to here as the ‘false positive rate’ (as low as 5-10% in the context of the Union Army records). However, achieving a low false positive rate comes at a cost of accepting a relatively low (true) match rate (10-30%). Alternatively, researchers can choose algorithms with higher (true) match rates (50-60%) at a cost of higher discrepancy rates (15-30%).

Figure 1 Accuracy versus efficiency: Comparing linking algorithms (Union Army 1900 census)

Strikingly, hand linking that relies on the typically used linking variables (name, age and place of birth) results in trade-offs that are similar in magnitude, producing ‘false positive’ rates of around 25% and (true) match rates of around 63%. Relative to more conservative automated methods, humans using typical linking variables tend to match more observations than automated methods but at a cost of higher rates of false positives. When humans and automated methods use the same information for linking, automated methods have very low discrepancy rates relative to hand links.

Our next test is to use data from two different transcriptions of the 1940 federal census, one transcribed by FamilySearch and one by Ancestry.com. In this case, we can establish a real ‘ground truth’ because records listed on the same census manuscript page and line number are known to refer to the same individual, but we are only able to look at certain kinds of linking failures (e.g. due to transcription errors but not mortality between census waves). We find that differences in transcription for names (but not ages) are generally high, particularly for the foreign born from non-English speaking countries. Between 7% and 14% of first names (and 17–32% of last names) have at least a one-character difference in the two transcribed versions. Despite transcription differences, we find that automated methods produce links between these two versions of the 1940 Census that are almost 100% correct. We note that, even when linking a census to itself, we can only link around 50% of the observations. This suggests that there is an upper bound for match rates of any linking method – automated or by hand – due to transcription quality and common names like James Smith for which it is harder to find a unique match in the two datasets.

Ultimately, the goal of constructing linked samples is to conduct economic analyses. The final section of our paper studies how automated linking methods affect inference using two samples. In one exercise, we examine the sensitivity of regression estimates to the choice of linking algorithm using linked data from the 1915 Iowa census to the 1940 federal census. These data allow us to study a set of typical regressions documenting intergenerational mobility between fathers and their sons. Here we do not have a proxy for ground truth, but we can compare the results we obtain in samples linked by hand to those we obtain in samples linked using automated methods. Across a wide set of outcome and explanatory variables, we find that parameter estimates are stable across linking methods, with the estimates using automated and hand links similar in magnitude and well within each other’s 95% confidence intervals. This stability is not surprising, since we also find that human linkers and automated methods agree in over 90% of cases. In the few cases of discord, it is not clear from inspection whether computer algorithms or hand linkers are correct.

Conclusion

Overall, we conclude that automated methods perform well: it is possible to use automated methods to generate samples with very low rates of false positives, and estimates using different automated methods are stable in most cases. Automated methods are also no worse on these dimensions than hand links created with standard linking variables. Throughout the paper we also provide general guidance for researchers and offer practical tips. Our overarching advice is to create alternative samples using the various automated methods and test the robustness of the results across samples.

References

Abramitzky, R, L P Boustan, E Jácome, and S Pérez (2019), “Intergenerational Mobility of Immigrants in the US over Two Centuries”, NBER Working Paper No. 26408.

Abramitzky, R, L P Boustan, K Eriksson, J J Feigenbaum, and S Pérez (2019), “Automated linking of historical data”, NBER Working Paper No. 25825.

Aizer, A, S Eli, J Ferrie, and A Lleras-Muney (2016), "The long-run impact of cash transfers to poor families", American Economic Review 106(4): 935-71.

Collins, W J, and M H Wanamaker (2014), "Selection and economic gains in the great migration of African Americans: new evidence from linked census data", American Economic Journal: Applied Economics 6(1): 220-52.

Costa, D L, H DeSomer, E Hanss, C Roudiez, S E. Wilson, and N Yetter (2017), "Union Army veterans, all grown up", Historical Methods: A Journal of Quantitative and Interdisciplinary History 50(2): 79-95.

Eriksson, K (2018), "Education and incarceration in the Jim Crow South: Evidence from Rosenwald schools", Journal of Human Resources 0816-8142R.

Feigenbaum, J J (2018), "Multiple measures of historical intergenerational mobility: Iowa 1915 to 1940", The Economic Journal 128(612): F446-F481.

Long, J, and J Ferrie (2013), "Intergenerational occupational mobility in Great Britain and the United States since 1850”, American Economic Review 103(4): 1109-37.

1,050 Reads