«Development, Security, and Cooperation Policy and Global Affairs THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The ...»
There are three fundamental elements of sound and credible impact evaluations. First, such evaluations require measures relevant to desired project outcomes, not merely of project activity or outputs. Second, they require good baseline, in-process, and endpoint measures of those outcomes to track the effects of interventions over time. Finally, they require comparison of those who receive assistance with appropriate nontreatment groups to determine whether any observed changes in outcomes are, in fact, due to the intervention.
The committee’s discussions with USAID staff, contractors for USAID,
METHODOLOGIES OF IMPACT EVOLUTIONand our own field study of USAID missions have shown that, even within the current structure of project monitoring, USAID is already engaged in pursuing the first and second requirements. While in some cases progress remains to be made on devising appropriate outcome measures and in ensuring the allocation of time and resources to collect baseline data, USAID has generally recognized the importance of these tasks. These efforts do vary from mission to mission, according to their available resources and priorities, so considerable variation remains among missions and projects in these regards.
However, the committee found that there is little or no evidence in current or past USAID evaluation practices that indicates the agency is making regular efforts to meet the third requirement—comparisons.
With rare exceptions, USAID evaluations and missions generally do not allocate resources to baseline and follow-up measurements on nonintervention groups. Virtually all of the USAID evaluations of which the committee is aware focus on studies of groups that received USAID DG assistance, and estimates of what would have happened in the absence of such interventions are based on assumptions and subjective judgments, rather than explicit comparisons with groups that did not receive DG assistance. It is this almost total absence of comparisons with nontreated groups, more than any other single factor, that should be addressed in order to draw more credible and powerful conclusions about the impact of USAID DG projects in the future.
To briefly illustrate the importance of conducting baseline and followup measurements for both treated and nontreated comparison groups,
consider the following two simple examples:
1. A consulting firm claims to have a training program that will make legislators more effective. To demonstrate the program’s effectiveness, the firm recruits a dozen legislators and gives them all a year of training. The firm then measures the number of bills those legislators have introduced in parliament in the year prior to the training and the number of bills introduced in the year following the training and finds that each legislator increased the number of bills he or she had introduced by 30 to 100 percent! Based on this the consultants claim they have demonstrated the efficacy of the program.
Yet to know whether or not the training really was effective, we would need to know how much each legislator’s performance would have changed if he or she had not taken the training program. One way of answering this question is to compare the performance of the legislators who were trained to the performance of a comparable set of legislators who were not. When someone points this out to the consultants and they go back and measure the legislative activity of all the legislators for IMPROVING DEMOCRACY ASSISTANCE the prior year, they find that the legislators who were not in the training group introduced, on average, exactly the same number of bills as those who were trained.
What has happened? It is possible that the increase in the number of bills presented by all legislators resulted from greater experience in office, so that everyone introduces more bills in his or her third year in office than in the first year. Or there may have been a rule change, or policy pressures, that resulted in a general increase in legislative activity. Thus it is entirely possible that the observed increase in legislative activity by those trained had nothing to do with the training program at all, and the program’s effect might have been zero.
Or it is possible that those legislators who signed up for the program were an unusual group. They might have been those legislators who were already the most active and who wanted to increase their skills. Thus the program might have worked for them but would not have worked for others. Another possibility is that the legislators who signed up were those who were the least actie and who wanted the training to enable them to “catch up” with their more active colleagues. In this case the results do show merit to the training program, but again it is not clear how much such a program would help the average legislator improve.
The only way to resole these arious possibilities would be to hae taken measures of legislatie actiity before and after the training program for both those legislators in the program and those not in the program. While it would be most desirable to have randomly assigned legislators to take the training or not, that is not necessary for the before and after comparison measures to still yield valuable and credible information. For example, even if legislators themselves chose who would receive the training, we would want to know whether the trained group had previously been more active, or less active, than their colleagues not receiving training. We could also then make statistical adjustments to the comparison, reflecting differences in prior legislative activity and experience between those who were trained and those who were not, to help determine what the true impact of the training program was, net of other factors that the training could not affect.
In short, simply knowing that a training program increased the legislative activity of those trained does not allow one to choose between many different hypotheses regarding the true impact of that program, which could be zero or highly effective in providing “catch-up” skills to legislators who need them. The only way to obtain sound and credible judgments of a program’s effect is with before and after measurements on both the treatment and the relevant nontreatment groups.
program’s effectiveness, the firm recruits a dozen judges to receive the program’s training for a year. When the consultants examine the rate of perceived bribery and corruption, or count cases thrown out or settled in favor of the higher status plaintiff or defendant, in those courts where the judges were trained, they find that there has been no reduction in those measures of corruption. On this basis the donor might decide that the program did not work. However, to really reach this conclusion, the donor would have to know whether, and how much, corruption would have changed if those judges had not received the training. When the donor asks for data on perceived bribery and corruption, or counts of cases thrown out or settled in favor of higher status plaintiffs or defendants, in other courts it turns out to be much higher than in the courts where judges did receive the training.
Again, the new information forces us to ask: What really happened?
It is possible that opportunities for corruption increased in the country, so that most judges moved to higher levels of corruption. In this case the constant level of corruption observed in the courts whose judges received training indicated a substantially greater ability to resist those opportunities. So, when properly evaluated against a comparison group, it turns out that the program was, in fact, effectie. To be sure, however, it would be valuable to also have baseline data on corruption levels in the courts whose judges were not trained; this would confirm the belief that corruption levels increased generally except in those courts whose judges received the program. Without such data it is not known for certain whether this is true or whether the judges who signed up for the training were already those who were struggling against corruption and who started with much lower rates of corruption than other courts.
These examples underscore the vital importance of comparisons with groups not receiing the treatment in order to avoid misleading errors and to accurately evaluate project impacts. From a public policy standpoint, the cost of such errors can be high. In the examples given here, it might have caused aid programs to waste money on training programs that were, in fact, ineffective. Or it might have led to cuts in funding for anticorruption programs that were, in fact, highly valuable in preventing substantial increases in corruption.
This chapter discusses how best to obtain comparisons for evaluating USAID democracy assistance projects. Such comparisons range from the most rigorous possible—comparing randomly chosen treatment and nontreatment groups—to a variety of less exacting but still highly useful comparisons, including multiple and single cases, time series, and matched case designs. It bears repeating: The goal in all of these designs is to evaluate projects by using appropriate comparisons in order to increase confidence in drawing conclusions about cause and effect.
IMPROVING DEMOCRACY ASSISTANCE
PLAN OF THIS CHAPTERThe chapter begins with a discussion of what methodologists term “internal” and “external” validity. Internal validity is defined as “the approximate truth of inferences regarding cause-effect or causal relationships” (Trochim and Donnelly 2007:G4). The greater the internal validity, the greater the confidence one can have in the conclusions that a given project evaluation reaches. The paramount goal of evaluation design is to maximize internal validity. External validity refers to whether the conclusions of a given evaluation are likely to be applicable to other projects and thereby contribute to understanding in a general sense what works and what does not. Given that USAID implements similar projects in multiple country settings, the external validity of the findings of a given project evaluation is particularly important. This section of the chapter also stresses the importance of what the committee terms “building knowledge.” The second part of the chapter outlines a typology of evaluation methodologies that USAID missions might apply in various circumstances to maximize their ability to assess the efficacy of their programming in the DG area. Large N randomized designs permit the most credible inferences about whether a project worked or not (i.e., the greatest internal validity). By comparison, the post-hoc assessments that are the basis of many current and past USAID evaluations provide perhaps the least reliable basis for inferences about the actual causal impact of DG assistance.
Between these two ends of the spectrum lie a number of different evaluation designs that offer increasing levels of confidence in the inferences one can make.
In describing these various evaluation options, the approach taken in this chapter is largely theoretical and academic. Evaluation strategies are compared and contrasted based on their methodological strengths and weaknesses, not their feasibility in the field. While a first step is taken at the end of the chapter in the direction of exploring whether the most rigorous evaluation design—large N randomized evaluation—is feasible for many DG projects, a more extensive treatment of this key question is reserved for the chapters that follow, when the committee presents the findings of its field studies, in which the feasibility of various impact evaluation designs is explored for current USAID DG programs with mission directors and DG staff.
the need for, or imply the unimportance of, other types of M&E activities.
The committee recognizes that monitoring is vital to ensure proper use of funds and that process evaluations are important management tools for investigating the implementation and reception of DG projects. This report focuses on how to develop impact evaluations because the committee believes that at present this is the most underutilized approach in DG program evaluations and that therefore USAID has the most to gain if it is feasible to add sound and credible impact evaluations to its portfolio of M&E activities.
Second, the committee recognizes that not all projects need be, or should be, chosen for the most rigorous forms of impact evaluation.