High-stakes school testing: New evidence

Victor Lavy, Avraham Ebenstein, Sefi Roth

20 November 2014



Although many countries use high-stakes testing to rank students for college admission, the consequences of this policy are largely unknown. Does having a particularly good or bad performance on a high-stakes examination have long-term consequences for test takers, after accounting for a student’s cognitive ability? Insofar as there are permanent wage consequences to variation induced by completely random shocks to student performance, it suggests that the use of high-stakes testing as a primary method for ranking students may be inefficient. Aggregate welfare may also be reduced by relying too heavily on examinations that provide noisy measures of student quality, since it may lead to poor matching between students and occupations, and an inefficient allocation of labour. Recent debate over the planned redesign of the SATs has been in part motivated by concerns that the current version is highly random and does not represent a fair measure of student quality (New York Times 2014).1 In spite of a dearth of evidence regarding the consequences of these tests, they are used extensively globally to rank students and allocate opportunity by acting as a gatekeeper in admissions.

Assessing the consequences of using high-stakes examinations for ranking students is challenging. First, large data samples are generally not available with standardised test scores and wages during adulthood for a representative population.2 Second, since higher-ability students presumably perform better on high-stakes tests, it is difficult to separately distinguish the return to cognitive ability from the return to doing well on the examination. One possible solution is to examine the consequences of fluctuations in a random component affecting performance on these tests. A candidate is fluctuation in air pollution that might have an effect on cognitive acuity and test scores, therefore generating plausibly random variation in a given student’s outcome. Air pollution has been demonstrated to adversely affect human productivity across a variety of tasks (Graff Zivin and Neidell 2012, Chang et al. 2014). Since students are assigned to test sites without prior knowledge of pollution or the ability to reschedule, it represents an exogenous factor affecting performance. This may enable direct measurement of the return to the component of a student’s score that is related entirely to luck, and provide evidence regarding whether these tests do or do not have long-term consequences.

Pollution and test results in Israeli high schools

In two papers, we tackle this question directly using a dataset of Israeli high-school students for whom we observe their scores on the Bagrut, the pollution they experienced during their exam, and their income and education during adulthood. The Bagrut is a series of high-stakes exams students take following each year of high school and has tremendous importance on the admissions prospects for Israeli students at top schools, and also determines the range of occupations available to a student. Our two papers respectively attempt to assess (1) the sensitivity of Bagrut scores to pollution exposure and (2) the influence of Bagrut scores on adult economic outcomes.

In the first paper, we examine the universe of Israeli test takers during 2000–2002 where we observe pollution and outcomes for over 400,000 subject examinations. Since we observe the same student at multiple test administrations following each year of high school, we can control for both time-invariant features of both a school and of a particular student. The rigorous nature of the Bagrut tests and the precise scoring of the exams provide a context to analyse a potential link between cognition and air pollution, even if there are only modest declines in cognitive performance due to pollution. Furthermore, Israel’s small size and well-developed monitoring system implies that most of its testing locations are near a station where we observe precise levels of pollution concentration. Lastly, Israel’s ethnic heterogeneity provides a context to examine the responsiveness of different groups to pollution, and potentially distinguish between different mechanisms by which pollution may affect cognitive performance. We present considerable evidence in the paper that scores are significantly impacted by pollution, and the heterogeneity in the measured effects points to a physiological mechanism for this effect. Note also that the effect is very temporary. As shown in Figure 1, the impact of pollution is pronounced the day of the exam, with much smaller effects observed the day before or after.

Figure 1. Impact of PM2.5on test scores in the days pre- and post-examination

Notes: The figure plots the coefficients from a regression of Bagrut test scores on PM2.5 readings in the days prior to, the day of (Day=0), and the days following the examination. Standard errors are clustered by school.

In the second paper, we examine student-level data and ask a more compelling economic question: What are the long run implications for a student of having a ‘bad day’ on a high-stakes exam? We evaluate this question using the same sample of students, and their average exposure to pollution across their examinations. Since students at the same school take exams on different days, there is considerable variation in pollution exposure even after including school fixed effects. Our results are highlighted in Table 1. We estimate that an additional 10 units of PM2.5 (suspended particulates smaller than 2.5 μm) in average exposure is associated with a 0.023% decline in a student’s Bagrut composite score, a 0.03% decline in Bagrut matriculation certification, a 0.019% decline in enrolment in post-secondary education, and a 0.15 decline in years of education at university. The wage consequences are also significant, with an additional 10 units of PM2.5 in average pollution exposure lowering monthly income by 109 Israeli shekel ($29) – roughly a 2% decline. This suggests that students who take an exam during a severe pollution episode experience non-trivial long-run consequences, both academically and economically.

Table 1. Reduced form effect of particulate matter On post-secondary education and adult earnings

Notes: Each cell in the table represents a separate regression. The table reports the relationship between average PM2.5(AQI) during the Bagrut and the listed outcome, estimated using the student-level sample described in Table 1. All regressions include suppressed controls for average temperature and humidity during the Bagrut, mother's and father's years of schooling, sex, and age in 2010. The coefficients are reported per 100 units of AQI. Standard errors are clustered at the school level, are heteroskedastic-consistent, and  are reported below the coefficients in parentheses.

In further analysis, we exploit the strong first-stage relationship between average pollution exposure during Bagrut exams and Bagrut composite scores to estimate the economic return to each additional point on the Bagrut. Using two-stage least squares (2SLS), we estimate that each point is worth (in 2010) 66 shekels ($18) in additional monthly earnings. Since the standard deviation on Bagrut composite scores is roughly 24 points, this implies that there are significant wage consequences to the exam, even for relatively small deviations in one’s score. In light of the strong relationship between Bagrut pollution exposure and post-secondary education, we are able to use 2SLS to estimate the implied return to higher education. We estimate that an additional post-secondary year of schooling is worth 707 shekels ($191) per month, an implied return to college education of 14%. This rate is only marginally higher relative to existing estimates found in Israel and elsewhere for the return to post-secondary education (Frish 2009, Oreopoulos and Petronijevic 2013, Angrist and Chen 2011, Angrist and Lavy 2009). While the exclusion restriction may be violated in this context, the similarity of our estimates to existing estimates of the return to post-secondary education is supporting evidence that the magnitude of our estimated effects of PM2.5 on schooling and economic outcomes are reasonable.3

Concluding remarks

The findings of the study contribute to a growing body of evidence that cognition is affected by pollution, which suggests that a narrow focus on the health benefits of pollution reduction may understate the true benefit of reducing air pollution. Second, our analysis highlights a major drawback of using high-stakes examinations to rank students. If completely random variation in scores can still matter ten years after a student completes high school, this suggests that placing too much weight on high-stakes exams like the Bagrut may not be consistent with meritocratic principles. Also, by temporarily lowering the productivity of human capital, high pollution levels may lead to allocative inefficiency as students with higher human capital may be assigned a lower rank than their less qualified peers. This may lead to inefficient allocation of workers across occupations, and possibly a less productive workforce. The results highlight the danger in assigning too much weight to a student’s performance on a high-stakes exam, rather than their overall academic record.


1 In a recent discussion of the planned revisions to the SATs, the president of the College Board stated that “only 20 percent…see the college-admission tests as a fair measure of the work their students have done.” (New York Times 2014)

2 Note that in the US, the Educational Testing Service (ETS) is notoriously private and no scholarship (to our knowledge) has been carried out linking SAT scores to adult outcomes for even small subsets of the population. For military recruits, the ASVAB has been made available, but it is unclear how relevant this is for other sub-populations (Cawley et al. 2001).

3 Since the Bagrut composite score directly affects the post-secondary education options available to a student, 2SLS models of the return to post-secondary education using pollution as an instrument will be biased by the omission of the Bagrut composite score.



