«Development, Security, and Cooperation Policy and Global Affairs THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The ...»
government training project for disadvantaged workers. Evaluators collected longitudinal data on those individuals who went through the JTPA and those who did not. Since the individuals who received the services were not chosen randomly, the evaluators constructed a nontreated group to compare them with, based on a number of criteria that matched the “in group” along many characteristics, such as location of residence, eligibility for the program, income, and education. Using this matching design, the evaluators were able to compare the effect of the project by gathering data before and after it started.
Another technique to use in a large N situation is the regression discontinuity design (Shadish et al 2002:Chap. 7, Hahn et al 2001). Regression discontinuity is used in situations where the assignment of the treatment is based on the characteristics of the group that a policy is designed to affect, and the before and after outcomes of interest are measured for both groups. For example, in a reading program the assignment of a remedial 4 See Heckman (1997) for a more extensive discussion of the implicit behavioral assump
reading project is based on the preproject tests on the readers. At some cutoff point, students are assigned to the project or not. The expectation is that project success would produce a more positive trend after the intervention for those below the cutoff point. The trend before and after the intervention is estimated, and the differences are compared to see if the intervention had any discernible effect.
Angrist and Lavy (1999), for example, used the regression discontinuity design to evaluate the effect of classroom size on student test scores in Israel. They compared classes with greater than and less than 40 students and found that class size was, in fact, linked to test performance.
Yet another design useful to large N samples is the difference-indifference (DD) approach. A DD design compares two cases, one that received the project and one that did not and compares the difference between their before and after levels on the relevant outcome variable.
DD estimation has become a widespread method to estimate causal relationships (Bertrand et al 2004). For example, if the DG project provides assistance to one judge and not another, before and after measures of a particular outcome variable should be taken for both and compared. In a regression that followed this design, the differences for each judge’s behavior and between each judge’s behavior are both estimated. The appeal of DD comes from its simplicity as well as its potential to circumvent many of the endogeneity problems that typically arise when making comparisons between heterogeneous individuals (Meyer et al 1995).
In an example of this approach, Duflo (2000) used a DD design to evaluate the effect of school construction on education and wages in Indonesia. Across several regions she compares one region’s school construction with another that has not yet had its construction. As always, baseline data were critical to discovering any effect from the program.
This design is useful when there is only one or a few treated units and is better than just a before-and-after analysis of a single unit since it offers a controlled comparison.
Efforts to use statistical methods to approximate randomized designs are only as effective as the evaluator’s ability to model the selection process that led some units to be given the treatment while others were not.
Attention to gathering a battery of pretreatment measures across cases is critical to an effective large N comparison. With sufficient cases and systematic efforts to measure pre- and posttreatment outcomes, large N comparisons can provide meaningful insights into project impacts even when the treatment cannot be manipulated through randomization by USAID.
Small N Randomization 3.
In some instances it is possible to manipulate the policy of interest 0 IMPROVING DEMOCRACY ASSISTANCE (the treatment) but only across a very small set of cases. In this case it is not possible to use probability tests derived from statistical theory to gauge the causal impact of an experiment across groups where the treatment and control groups each have only one or several members or where there is no control whatsoever. However, in other respects the challenges posed by, and advantages accrued from, this sort of analysis are quite similar to the large N randomized design.
Where cross-unit variance is minimal (by reason of the limited number of units at one’s disposal), the emphasis of the analysis necessarily shifts from spatial evidence (across units) to evidence garnered from temporal variation (i.e., to a comparison of pre- and posttests in the treated units). Naturally, one wants to maximize the number of treated units and the number of untreated controls. This can be achieved by a modified “rollout” protocol. Note that in a large N randomized setting (as described above), the purpose of rollout procedures is usually (1) to test a complex treatment (e.g., where multiple treatments or combinations of treatments are being tested in a single research design) or (2) for purposes of distributing a valued good among the population while preserving a control group. The most crucial issue is to maximize useful variation on the available units. This can be achieved by testing each unit in a serial fashion, regarding the remaining (untreated) units as controls.
Consider a treatment that is to be administered across six regions of a country. There are only six regions, so cross-unit variation is extremely limited. To make the most of this evidence-constrained setting, the researcher may choose to implement five iterations of the same manipulated treatment, separated by some period of time (e.g., one year). During all stages of analysis, there remains at least one unit that can be regarded as a control. This style of rollout provides five pre- and posttests and a continual (albeit shrinking) set of controls. As long as contamination effects are not severe, the results from this sort of design may be more easily interpreted than the results from a simple split-sample research design (i.e., treating three regions and retaining the others as a control group). In the latter any observed variation across treatment and control groups may be due to a confounding factor that coincides temporally and correlates spatially with the intervention.
Despite the randomized nature of this intervention, it is still quite possible that other matters beyond the control of the investigator may intercede. It is not always possible to tell whether or not confounding factors are present in one or more of the cases. In a large N setting, we can be more confident that such confounding factors, if present, will be equally distributed across treatment and control groups. Not so for the small N setting. This is all the more reason to try to maximize experimental leverage by setting in motion a rollout procedure that treats each unit
METHODOLOGIES OF IMPACT EVOLUTIONseparately through time. Any treatment effects that are fairly consistent across the six cases are unlikely to be the result of confounding factors and are therefore interpretable as causal rather than spurious.
Note that in a small population where all units are being treated, it is likely that there will be significant problems of contamination across units.
In the scenario discussed above, for example, it is likely that untreated regions in a country will be aware of interventions implemented in other regions. Thus it is advisable to devise case selection and implementation procedures that minimize potential contamination effects. For example, in the rollout protocol discussed above, one might begin by treating regions that are most isolated, leaving the capital region for last.
Regardless of the procedure for case selection, it will be important for researchers to pay close attention to potential changes before and after the treatment is administered. That is, in small N randomization designs, it is highly advisable to collect baseline data since the comparison groups are less likely to be similar enough to compare directly.
In an example of a small N randomized evaluation, Glewwe et al (2007) used a very modest sample of 25 randomly chosen schools to evaluate the effect of the provision of textbooks on student test scores.
A Dutch nonprofit organization provided textbooks to 25 rural Kenyan primary schools chosen randomly from a group of 100 candidate schools.
The authors found no evidence that the project increased average scores, reduced grade repetition, or affected dropout rates (although they did find that the project increased the scores of the top two quintiles of those with the highest preintervention academic achievement). Evidently, simply providing the textbooks only helped those who were already the most motivated or accomplished; in the absence of other changes (e.g., better attendance, more prepared or involved teachers), the books alone produced little or no change in average students’ achievement. It is important to note that, like other forms of impact evaluation, this study required good baseline data to conduct its evaluation.
Small N Comparison 4.
In small N designs USAID may be unable to manipulate the temporal or spatial distribution of the treatment. In this context the evaluator faces the additional hurdle of not having sufficient cases to employ statistical procedures to correct for the biases that make identifying causal effects difficult when treatments cannot be manipulated.
Nonetheless, there are still advantages to identifying units that will not be treated and gathering pre- and posttreatment measures of outcomes in both the treatment and control groups. A control group is useful here for (1) ruling out the possibility that the intervention coincided with a temporal change or trend that might account for observed changes in IMPROVING DEMOCRACY ASSISTANCE the treatment group and (2) ensuring that application of the treatment was not correlated with other characteristics of the treated units that could explain observed differences between the treatment and control groups.
Ideally, the control group in a small N comparison should be matched to the treatment group as precisely as possible. With large amounts of data, propensity score matching techniques can be used to identify a control group that approximates the treated units across a range of observables.
When data are not widely available, a control group can be generated qualitatively by identifying untreated units that are similar to those in the treatment group on key dimensions (other than the treatment) that might affect the outcomes of interest.
5. N = 1 Study with USAID Control over Timing and Location of Treatment Sometimes, there is no possibility of spatial comparison. This is often the case where the unit of concern exists only at a national level (e.g., an electoral administration body), and nearby nation-states do not offer the promise of pre- or posttreatment equivalence. In this case the researcher is forced to reach causal inferences on the basis of a single case. Even so, the possibility of a manipulated treatment offers distinct advantages over the unmanipulated (observed) treatment. The ability to choose the timing of the intervention and plan observations to maximize the likelihood of accurate inferences can provide considerable leverage for credible conclusions. However, these advantages accrue only if very careful attention is paid to the timing of the intervention, the nature of the intervention, its anticipated causal effect, and the pre- and posttreatment evidence that might be available. The challenge here is to overcome the problems that are already highlighted here with regard to simple before and after comparisons.
First, with respect to timing, it is essential that the intervention occur during a period in which no other potentially contaminating factors are at work and in which the outcome factors being observed would be expected to be relatively stable; that is, a constant trend is expected, so that any changes in that trend are easily interpreted. Naturally, these matters lie partly in the future and therefore cannot always be anticipated. Nonetheless, the delicacy of this research design—its extreme sensitivity to any violation of ceteris paribus assumptions—requires the researcher to anticipate what may occur, at least through the duration of the experiment.