«Development, Security, and Cooperation Policy and Global Affairs THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The ...»
ing, that the large N randomized intervention is the only viable evaluation tool available to USAID.3 If this were the case, many projects—and the millions of dollars used to fund them—could not be the subject of impact evaluations. It is for this reason that the committee offers a longer list of options than is recognized by many current texts on project evaluation (e.g., Bloom 2005, Duflo et al 2006b). But the results of the committee’s visits to USAID offices in the field, review of USAID documents, and discussions with USAID DG officials and implementers suggest that using randomization is feasible at least in theory in many instances, which would greatly enhance the ability to evaluate the impacts of a project.
Of course, no simple classification of types can hope to address all the research design issues raised by the multifaceted programs supported by USAID’s DG portfolio of projects. Arguably, every policy intervention is in some respects unique and thus poses different research design issues. Measuring impact is not easy. The committee offers the foregoing typology as a point of departure, a set of categories that capture the most salient features of different policies now supported by the USAID DG office, and the ways in which the causal impact of these policies might be feasibly evaluated. Citations in the text to existing work on these subjects should provide further guidance for project officers and implementers, although the literature on large N randomized treatment research designs is much more developed than the literature on other subjects.
Large N Randomized Evaluation 1.
The ideal research design is the randomized impact evaluation.
Because of its technical demands, this approach should be employed where USAID DG officials have a strong interest in finding out the impact of an important project, especially those that are implemented in a reasonably similar form across countries (e.g., decentralization initiatives, civic education projects, election monitoring efforts). Here, a large number of units are divided by random selection into treatment and control groups, a treatment is administered, and any differences in outcomes across the two groups are examined for their significance. Randomizing the treatment attempts to break a pool of possible treated units into two groups that are similar, indeed indistinguishable, before the treatment. Then, after the treatment, measurement on the desired outcome is taken for both groups.
If there is a difference in outcomes between the groups, it can reasonably be inferred that the difference was attributable to the policy.
Randomization creates the best comparisons because the two groups—treated and untreated—are more alike than in any other design.
Because randomization, with sufficiently large numbers of units, creates 3 For further discussion of these issues, see Gerring and McDermott (2007).
METHODOLOGIES OF IMPACT EVOLUTIONtwo groups in which all characteristics can be assumed to be equally distributed across the two groups, there is technically no need to have preintervention baseline measures, as these measures are assumed to be the same in each group due to their random assignment. The ability to do without baseline measures in large N randomized assignment designs could actually reduce the expenditure on this type of evaluation, as opposed to the costs incurred in other designs that require gathering data on baseline indicators. As discussed above, in the context of many projects in a country, gathering baseline data to evaluate the intervention in different ways, and measure other efforts, including activities and outputs would still be valuable.
Another advantage of randomized assignment in large N studies is that it often is perceived as the fairest method of distributing assistance in cases where the ability of USAID to provide DG assistance is limited and cannot cover all available units. Thus, for example, if only a certain fraction of judges or legislators in a country, or a certain fraction of villages in a district, can be served by a given assistance program, having a lottery to determine who gets assistance first is often judged even by participants as the fairest way to allocate resources. Since this method also creates the best impact evaluation design, it is a situation in which the ethics of assigning assistance and the goals of evaluation design are mutually reinforcing.
Common variations on the randomized treatment include “rollout” and “waiting list” protocols. With rollout protocols the treatment is given sequentially to different groups, with the order in which groups receive the treatment determined by random assignment. This solves the problem of how to distribute valued resources in a way that eventually makes them available to all but without destroying the potential for randomized control. It also offers the possibility of varying the treatment across each cohort, contingent on findings from previous cohorts. With waiting list protocols, the control group is comprised of those groups that are otherwise qualified and hence similar to the groups receiving treatment but were placed on a waiting list because of limits on funding. Evaluation is then undertaken on random samples from both the treatment and waiting list (control) groups. These latter groups may (or may not) be treated in subsequent iterations.
There are a number of well-known problems that can undermine the effectiveness of this research design, which can be found in many methodology texts (e.g., see Box 9.2 in Trochim and Donnelly 2007), some of which will be discussed here. Perhaps most noteworthy in the case of many USAID projects is the risk of contamination, in which the treatment of some individuals or groups (e.g., training some judges or legislators) also affects the behavior of those not enrolled in training. In addition, IMPROVING DEMOCRACY ASSISTANCE randomized designs may encounter other problems, such as units refusing to participate in the design or units dropping out in the middle of the intervention. However, if large numbers of cases are available, most of these issues can be reasonably dealt with by amending the research design, so that if recognized and managed, these problems will not fatally undermine the validity of the evaluation.
The committee recognizes that political pressures to work with certain groups or locations, or to “just get the project rolling,” can work against the careful design and implementation of randomized assignments. These and other problems are addressed in a more detailed discussion of how to apply randomization to actual USAID DG projects in Chapter 6. The present chapter focuses mainly on the methodological reasons why the efforts needed to carry out randomized assignments for project evaluations can be worthwhile in terms of the increased confidence they provide that genuine causal relationships are being discovered and hence real project impact.
Unfortunately, from the standpoint of making the most credible impact evaluations, the units chosen to receive interventions from USAID are seldom selected at random. For example, nongovernmental organizations (NGOs) chosen for funding are often selected though a competition that results in atypical NGOs getting treatments. Or judges and administrators chosen to attend training workshops are selected based mainly on their willingness to participate. The problem here is that the criteria used for selecting NGOs and judges/administrators for participation in the project are almost certainly associated with a higher propensity to succeed in the project than would be the case for the “typical” NGO or judge, and this makes it impossible to assess project efficacy. If funded NGOs are found to do well or judges/administrators who attended workshops perform better, there is no way to rule out the possibility that the success observed is simply a function of having chosen groups or people who would have succeeded anyway or whose success was much greater than could generally be expected. The only way to avoid this pitfall—and to be in a position to know whether or not the project has had a positive impact—is to choose project participants randomly and then compare their performance to participants who were not selected to take part in the activity in question.
The bottom line is that if there is a strong commitment to answering the question—“Did the resources spent on a given project truly yield positive results?”—the best way to reach the most definitive answer is through an impact evaluation that involves the random selection of units for treatment and the collection of data in both treatment and control groups. As discussed in Chapter 6, many USAID DG projects that the committee encountered in the field were quite amenable in principle to randomizaMETHODOLOGIES OF IMPACT EVOLUTION tion without significant changes in their design. And it bears repeating that in some cases of large N randomized treatment, USAID may be able to eliminate the costs of collecting baseline data, which might make this evaluation design more attractive.
Randomized evaluations are useful for determining not only whether or not a given project/activity has had an effect but also where it appears to be most effective. To see this, consider Figure 5-1, which displays hypothetical data collected on outcomes among treatment and control groups for a particular USAID project. In this example, higher scores represent more successful outcomes.
Based on these data, it can be concluded that the treatment was a success since, on average, units (people, municipalities, courts, NGOs, etc.) given the treatment scored better on the outcome of interest than units in the control group. (This would need to be confirmed with a statistical test, but for now assume the two distributions are indeed different.) It is important to point out, however, that not every unit in the treatment group did better than units in the control group. Some units in the control group did better than those in the treatment group, and some in the treatment group did worse than those in the control group. In fact, at least a handful of units in the treatment group did worse than the average unit in the control group. Also, there is quite a bit of variance in the performance of those in the treatment group. By exploring the factors associated with high and low scores among the treatment cohort, inferences can be made about which ones predispose recipients of the treatment to success or fail
ure (or, put slightly differently, where the project works well and not so well). Thus the randomized design allows us to conclude not just whether the project was effective in achieving its goals but also where efforts should be directed in the next phase in order to maximize the impact.
Large N Comparison 2.
Despite the utility of the large N randomized design, sometimes it is simply not possible to assign units randomly to the treatment group, even when the total number of units is large. The benefits of a large number of units for observing multiple iterations of the treatment, however, can still be exploited if one can overcome the following challenge: identifying and measuring those pretreatment differences between the treatment and control groups that might account for whatever posttreatment differences are observed. In these circumstances there are a variety of statistical procedures (e.g., propensity score matching, instrumental variables) for correcting the potential selection bias that complicates the analysis of causal effects.
The “matching” research design seeks to identify units that are similar to the ones getting treatment and then comparing outcomes.4 For example, Heckman et al (1997) sought to evaluate a jobs training project in the United States—the Job Training Partnership Act (JTPA). The JTPA provides on-the-job training, job search assistance, and classroom training to youth and adults who qualify (see Devine and Heckman  for a more detailed analysis of the program). The U.S. Department of Labor commissioned an evaluation of the project to assess the impact of the main U.S.