«Development, Security, and Cooperation Policy and Global Affairs THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The ...»
Doing so would likely impose an unacceptably high cost on USAID’s DG programming. The committee is therefore recommending that such evaluations initially only be undertaken for a select few of USAID’s DG programs, a recommendation emphasized in Chapter 9. The committee does believe, however, that DG officers should be aware of the potential value of obtaining baseline and comparison group information for projects to which they attach great importance, so that they can better decide how to develop the mix of M&E efforts across the various projects that they oversee.
Third, before beginning the task of evaluating a project, precisely what is to be evaluated must be defined. Evaluating a project requires the identification of the specific intervention and a set of relevant and measurable outcomes thought to result from that policy intervention. Even this apparently simple task can pose challenges, since most DG programs are complex (compound) interventions, often combining several activities (e.g., advice, formal training, monetary incentives) and are often expected to produce several desired outcomes. A project focused on the judiciary, for example, may include a range of different activities intended to bolster the independence and efficiency of the judiciary in a country and might be expected to produce a variety of outcomes, including swifter processing of cases, greater impartiality among plaintiffs and defendants, greater conformity to statutes or precedents, and greater independence vis-à-vis the executive. The evaluator must therefore decide whether to test the whole project or parts of the project or whether it would make sense, as discussed further below, to reconfigure the project to allow for clearer impact evaluation of specific interventions.
As USAID’s primary focus will always be on program implementation, rather than evaluation per se, evaluators will need to respond to the challenges posed by often ambitious and multitasked programs.
At this point, a note on terminology is required. As noted above, an “activity” is defined as the most basic sort of action taken in the field, such as a training camp, a conference, advice rendered, money tendered, IMPROVING DEMOCRACY ASSISTANCE and so forth. A “project” is understood to be an aggregation of activities, including all those mentioned in specific USAID contracts with implementers, such as in requests for proposals and in subsequent documents produced in connection with these projects. A project can also be referred to as an “intervention” or “treatment.” The question of what constitutes an appropriate intervention is a critical issue faced by all attempts at project evaluation. A number of factors impinge on this decision.
Lumping activities within a given project together for evaluation often makes sense. If all parts of a program are expected to contribute to common outcomes, and especially if the bundled activities will have a stronger and more readily observed outcome than the separate parts, then treating the set of activities together as a single intervention may be the best way to proceed.
In other cases, trying to separate various activities and measuring their impact may be preferred. The value of disaggregation seems clear from the standpoint of impact evaluation. After all, if only one part of a five-part program is in fact producing 90 percent of the observed results, this would be good to know, so that only that one part continues to be supported. But whether or not such a separation seems worth testing really depends on whether it is viable to offer certain parts of a project and not others. Sometimes it is possible to test both aggregated and disaggregated components of a project in a single research design. This requires a sufficient number of cases to allow for multiple treatment groups. For example, Group A could receive one part of a program, Group B could receive two parts of a program, Group C could receive three parts of a program, and another group would be required as a control. In this example, three discrete interventions and their combination could be evaluated simultaneously.
Many additional factors may impinge on the crafting of an appropriate design for impact evaluation of a particular intervention. These are reviewed in detail in the subsequent section. The committee understands that there is no magic formula for deciding when an impact evaluation might be desirable or which design is the best trade-off in terms of costs, need for information, and policy demands. What is clear, however, is that since impact evaluations are, in effect, tests of the hypothesis that a given intervention will create different outcomes than would be observed in the absence of that intervention, how well one specifies that hypothesis greatly influences what one will find at the end of the day. The question asked determines the sort of answers that can be received. The committee wants to flag this as a critical issue for USAID policymakers and project implementers to consider; further suggestions are given in Chapters 8 and 9 for how this could be addressed as part of an overall Strategic and
METHODOLOGIES OF IMPACT EVOLUTIONOperational Research Agenda project for learning about DG program effectiveness to guide policy programming.
Internal validity A sound and credible impact evaluation has one primary goal: to determine the impact of a particular project in a particular place at a particular time. This is usually understood as a question of internal validity.
In a given instance, what causal effect did a specific policy intervention, X, have on a specific outcome, Y? This question may be rephrased as: If X were removed or altered, would Y have changed?
Note that the only way to answer this question with complete certainty is to go back in time to replay history without the project (called the “the counterfactual”). Since that cannot be done, we try to come as close as possible to the “time machine” by holding constant any background features that might affect Y (the ceteris paribus conditions) while altering X, the intervention of interest. We thus replay the scenario under slightly different circumstances, observing the result (Y).
It is in determining how best to simulate this counterfactual situation of replaying history without the intervention that the craft of evaluation design comes into play. Indeed, a large literature within the social sciences is devoted to this question—often characterized as a question of causal assessment or research design (e.g., Shadish et al 2002, Bloom 2005, Duflo et al 2006b). The following section attempts to reduce this complicated set of issues down to a few key ingredients, recognizing that many issues can be treated only superficially.
Consider that certain persistent features of research design may assist us in reaching conclusions about whether X really did cause Y: (1) interventions that are simple, strong, discrete, and measurable; (2) outcomes that are measurable, precise, determinate, immediate, and multiple; (3) a large sample of cases; (4) spatial equivalence between treatment and control groups; and (5) temporal equivalence between pre- and posttests.
Each of these is discussed in turn.
1. The intervention: discrete, with immediate causal effects, measurable. A discrete intervention that registers immediate causal effects is easier to test because only one pre- and posttest is necessary (perhaps only a posttest if there is a control group and trends are stable or easily neutralized by the control). That is, information about the desired outcome is collected before and after the intervention. By contrast, an intervention IMPROVING DEMOCRACY ASSISTANCE that takes place gradually, or has only long-term effects, is more difficult to test. A measurable intervention is, of course, easier to test than one that is resistant to operationalization (i.e., must be studied through proxies or impressionistic qualitative analysis).
2. The outcome(s): measurable, precise, determinate, and multiple.
The best research designs feature outcomes that are easily observed, that can be readily measured, where the predictions of the hypotheses guiding the intervention are precise and determinate (rather than ambiguous), and where there are multiple outcomes that the theory predicts, some of which may pertain to causal processes rather than final policy outcomes. The latter is important because it provides researchers with further evidence by which to test (confirm or disconfirm) the underlying hypothesis linking the intervention to the outcome and to elucidate its causal mechanisms.
3. Large sample size. N refers here to the number of cases that are available for study in a given setting (i.e., the sample size). A larger N means that one can glean more accurate knowledge about the effectiveness of the intervention, all other things being equal. Of course, the cases within the sample must be similar enough to one another to be compared;
that is, the posited causal relationship must exist in roughly the same form for all cases in the sample or any dissimilarities must be amenable to posthoc modeling. Among the questions to be addressed are: How large is the N? How similar are the units (cases) in respects that might affect the posited causal relationships? If dissimilar, can these heterogeneous elements be neutralized by some feature of the research design (see below)?
4. Spatial equivalence (between treatment and control groups). By pure spatial comparisons what is meant are controls that mirror the treatment group in all ways that might affect the posited causal relationship.
The easiest way to achieve equivalence between these two groups is to choose cases randomly from the population. Sometimes, nonrandomized selection procedures can be achieved, or exist naturally, that provide
equivalence, but this is relatively rare. The key question to ask is always:
How similar are the treatment and control groups in ways that might affect the intended outcome? This is often referred to as “pretreatment equivalence.” Other important questions include: Can the treatment cases be chosen randomly, or through some process that approximates random selection? Can the equivalence initially present at the point of intervention between treatment and control groups be maintained over the life of the study (i.e., over whatever time is relevant to observe the putative causal effects)? This may be referred to as “posttreatment equivalence.”
5. Temporal equivalence (between pre- and posttests). Causal attribution works by comparing spatially and/or temporally. This is usually done through pre- and posttreatment tests (i.e., measurements of the outcome before and after the intervention, creating two groups, the preMETHODOLOGIES OF IMPACT EVOLUTION intervention group and the postintervention group. Of course, it is the same case, or set of cases, observed at two points in time. However, such comparisons (in isolation from spatial controls) are useful only when the case(s) are equivalent in all respects that might affect the outcome (except, of course, insofar as the treatment itself). More specifically, this means that (1) the effects of the intervention on the case(s) are not obscured by confounders, which are other factors occurring at roughly the same time as the intervention which might affect the outcome, and (2) the outcome under investigation either is stable or has a stable trend (so that the effect of the intervention, if any, can be observed). Note that when there is a good spatial control these issues are less important. By contrast, when there is no spatial control, they become absolutely essential to the task of causal attribution. For temporal control the key questions to ask are: Are comparable pre- and posttests possible? Is it possible to collect data for a longer period of time so that, rather than just two data points, one can construct a longer time series? Are there trends in the outcome that must be taken into account? If trends are present, are they fairly stable? Can we anticipate that this stability will be retained over the course of the research (in the absence of any intervention)? Is the intervention correlated (temporally) with other changes that might obscure causal attribution?