Using historical newspaper data to deal with measurement error

Andreas Ferrara, Joung Yeob Ha, Randall Walsh 18 May 2022

a

A

Data that were not collected in the past are often unavailable to researchers and policymakers in the present. A recent and fast-growing literature is attempting to get around this limitation by collecting information buried in old newspapers. Digitised newspaper articles from the past contain information on wages, prices, and events such as strikes, episodes of violence, and natural disasters. These articles offer fine-grained geographic and time variation as well. To date, historical newspapers have been used to extract information on, among other topics, anti-Black sentiment (Ottinger and Winkler 2022), fertility restrictions (Beach and Hanlon 2019), prices and types of cotton seeds (Rhode 2021). As with data extracted from maps via machine-learning methods (Combes et al. 2020), the increasing availability and scope of digital newspaper archives (e.g. newspapers.com or Chronicling America) opens avenues to generate long-run data previously unavailable to researchers. This development follows a recent trend of integrating historical data and methods into mainstream economics (Margo 2017).

Typically, researchers collect newspaper-based data for the purpose of using them as outcome, treatment, or control variables in statistical analysis. In a recent paper (Ferrara et al. 2022), we show how data generated from historical newspaper articles can be used for another important purpose: to resolve measurement error in statistical analysis. We build on the framework by Chalfin and McCrary (2018), who argue that when a researcher has two mis-measured variables for the same quantity of interest, one measure can serve as an instrument for the other and recover the true coefficient of interest as long as the errors in the two variables are uncorrelated. However, this is rarely an option, as collecting a second independent measure for the same quantity of interest tends to be prohibitively expensive. We show that a second such measure can be cheaply generated from digitised newspapers, and outline the conditions under which such a strategy is likely to succeed in practical and empirical terms. 

To illustrate our framework, we replicate two recent studies on the effect of the spread of the boll weevil on economic outcomes in the US South between 1892 and 1922 by Ager et al. (2017) and Clay et al. (2019). The spread of the boll weevil is commonly measured from a map that was published by the US Department of Agriculture (USDA). While this map is generally of high quality, it contains errors such as the crossing of date lines shown in Figure 1. Each of these lines marks the furthest spread of the cotton-boll consuming beetle in any given year; theoretically, the date lines cannot cross. If such errors occur at random, any kind of statistical analysis would underestimate the impact of the boll weevil. An accurate estimate of the pest’s effect is important because it informs policymakers who wish to use it as a baseline comparison for the spread of insects in other contexts today.

Figure 1 Errors in the USDA map for the spread of the boll weevil

Notes: The USDA map for the spread of the boll weevil shows the arrival date of the beetle in each year across Southern counties between 1892 and 1922. Each line marks a different arrival year, meaning in principle that these lines should not cross. In practice, such crossings occur; examples are marked by the red rectangles.

In order to resolve this measurement error problem in the Chalfin and McCrary (2018) framework, one would require a second measure of the boll weevil spread over time and space that could be collected at a reasonable cost. We generate such a measure from digitised newspapers by searching articles in newspapers.com that include the words “boll weevil” and the name of the county for which we seek to measure the pest’s arrival. While not all counties have recorded newspaper articles, newspapers in the same state would report appearances of the weevil in other parts of the state as in the examples provided in Figure 2.

Figure 2 Boll weevil reporting in local newspapers

Notes: Newspaper reports of the arrival of the boll weevil in Marion County, Mississippi, by the Jackson Daily News in Hinds County (left) and the Star Ledger in Attala County (right).

We construct a noisy newspaper-based measure of the boll weevil’s arrival date based on the maximum of the five-year moving average in the share of papers mentioning the pest together with a county’s name. We provide an example of the newspaper-based arrival measure in comparison to the USDA map date for Marion County in Figure 3. The newspaper data tend to be noisy, which is why we smooth out some of the noise by applying the five-year moving average. What should be noted is that for our framework to hold, it is not important whether the newspaper data provide a more or less noisy measure of the boll weevil's arrival date than the USDA map. The only required assumption is that errors in the map must be uncorrelated with errors in the newspaper coverage of the vermin.

Figure 3 Example of the newspaper-based and USDA arrival dates for Marion County, MS

Notes: The dashed line is the share of articles mentioning “boll weevil” and “Marion County” among all articles mentioning “Marion County” in every available newspaper outlet in Mississippi between 1882 and 1932. The solid line represents five-year moving averages of this share. The red horizontal line shows the boll weevil’s arrival in Marion County from the USDA map. The blue horizontal line indicates the predicted arrival at the maximum of the five-year moving average.

We outline three ways in which this second measure can be used to address measurement error in the USDA map arrival date variable. This includes set identification, sample restrictions, and a parametric bias correction. The most intuitive of these is the sample restriction, where we use observations for which the arrival date in the USDA map and in the newspaper-based measure coincide. We call this the ‘agreement sample’. While there is a chance that either of the two measures is wrong, the chance that both are jointly wrong has a significantly lower probability.

The main takeaways from our replications of Ager et al. (2017) and Clay et al. (2019) are the following. First, even when using the newspaper-based arrival date instead of the USDA date, we can still replicate the findings in both papers. Had the USDA map not existed, their studies could have been conducted with information gathered from digitised newspapers. This highlights the usefulness of digitised historical newspapers for the creation of novel data content. Second, using the newspaper-based arrival date by applying our three measurement error correction methods increased the effect sizes and strengthened the results in both papers. Even though the newspaper-based measure is coarse and was generated in a fast and affordable way, it provided substantial value in reducing measurement error in the original USDA map. 

Oftentimes, researchers tend to ignore measurement error in applied settings as long as some conventional level of statistical significance is achieved. When this is not the case, potentially promising research projects tend to be abandoned. We hope to provide a low-cost alternative for dealing with measurement error, especially in settings that use historical data where measurement error is a pervasive problem. Our strategy using newspaper-based information works best for quantities that can be easily extracted from newspapers, such as events that can be readily identified with single (or a small set of) search terms. Other quantities, such as prices, are substantially harder to extract; our solutions may not be a feasible avenue in those cases.

References

Ager, P and B Herz (2019), “From the farm to the factory floor: How the structural transformation triggered the fertility transition”, VoxEU.org, 16 May.

Ager, P, M Brueckner and B Herz (2017), “The boll weevil plague and its effect on the southern agricultural sector, 1889–1929”, Explorations in Economic History 65: 94–105.

Beach, B and W Hanlon (2019), “Censorship, family planning, and the historical fertility transition,” VoxEU.org, 4 August.

Chalfin, A and J McCrary (2018), “Are U.S. Cities Underpoliced? Theory and Evidence”, Review of Economics and Statistics 100(1): 167–186.

Clay, K, E Schmick and W Troesken (2019), "The Rise and Fall of Pellagra in the American South", Journal of Economic History 79(1): 32–62.

Combes, P-P, G Duranton, L Gobbilon, C Gorin and Y Zylberberg (2020), “(Decision) trees and (random) forests: Urban economics, historical data, and machine learning”, VoxEU.org, 17 November.

Ferrara, A, J Y Ha and R Walsh (2022), “Using Digitized Newspapers to Refine Historical Measures: The Case of the Boll Weevil”, NBER Working Paper No. 29808, February.

Margo, R (2017), “The integration of economic history into economics”, VoxEU.org, 3 September.

Ottinger, S and M Winkler (2022), "The Political Economy of Propaganda: Evidence from US Newspapers", IZA Working Paper No. 15078, February.

Rhode, P (2021), “Biological Innovation without Intellectual Property Rights: Cottonseed Markets in the Antebellum American South”, Journal of Economic History 81(1): 198–238.

a

A

Topics:  Economic history Frontiers of economic research

Tags:  measurement error, historical data, newspaper archives

Assistant Professor of Economics, University of Pittsburgh

PhD student, Department of Economics, University of Pittsburgh

Associate Professor, Department of Economics, University of Pittsburgh

Events

CEPR Policy Research