«Development, Security, and Cooperation Policy and Global Affairs THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The ...»
External validity External alidity is the generalizability of the project beyond a single case. To provide policymakers at USAID with relevant information, the results of a project evaluation should be generalizable; that is, they must be true (or plausibly true) beyond the case under study. Recall that we understand that impact ealuation (as opposed to project monitoring) will most likely be an occasional event applied to a set of the most important and most frequently used projects, not one routinely undertaken for all projects. This means that the value of the evaluation is to be found in the guidance it may offer policymakers in designing projects and allocating funds over the long term and across the whole spectrum of countries in which USAID works.
There will always be questions about how much one can generalize about the impact of a project. The fact that a project worked in one place, at one time, may or may not indicate its possible success in other places and at other times. The committee recognizes that the design of USAID projects and the allocation of funds are a learning process and the political situation and opportunities for intervention in any given country are a moving target. Even so, project officers must build on what they know, and this knowledge is largely based on the experiences of projects that are currently in operation around the world. Some projects are perceived 0 IMPROVING DEMOCRACY ASSISTANCE to work well while others are perceived to work poorly or not at all. It is these general perceptions of “workability” that are the concern here.
With a number of sound impact evaluations of a specific type of project in several different settings, USAID would be able to learn more from its interventions, rather than rely solely on the experiences of individuals.
To maximize the utility of such impact evaluations, each aspect of the
research design must be carefully considered. Two factors are paramount:
realism in evaluation design and careful case selection.
Realism means that the evaluation of a project should conform as closely as possible to existing realities on the ground; otherwise, it is likely to be dismissed as an exercise with little utility for USAID officers in the field. “Realities” refers to the political facts at home and abroad, the structure of USAID programming, and any other contextual features that might be encountered when a project is put into operation. The committee recognizes that some factors on the ground may need to be altered in order to enhance the internal validity of a research design, a matter addressed below. Yet for purposes of external validity in the policymaking world of USAID, these factors should be kept to a minimum.
Case selection refers to how cases—activities or interventions—are chosen for evaluation. Several strategies are available, each with a slightly different purpose. However, all relate to the achievement of external validity.
The most obvious strategy is to choose a typical case, a context that is, so far as one can tell, typical of that project’s usual implementation and also one that embodies a typical instance of posited cause-and-effect relationships. Otherwise, it may be difficult to generalize from that project’s experience.
A second strategy is known as the least likely (or most difficult) case.
If one is fairly confident of a project’s effectiveness, perhaps because other studies have already been conducted on that subject, confidence can be enhanced by choosing a case that would not ordinarily be considered a strong candidate for project success. If the project is successful there, it is likely to be successful anywhere (i.e., in “easier” circumstances). Alternatively, if the project fails in a least-likely setting, then one has established a boundary for the population of cases to which the project may plausibly apply.
A third strategy is known as the most likely case. As implied, this kind of case is the inverse of the previous: It is one where a given intervention is believed most likely to succeed. This kind of case is generally useful only when the intervention, against all odds, is shown by a careful impact evaluation to have little or no effect (otherwise, common wisdom is confirmed). Failure in this setting may be devastating to the received wisdom, for it would have shown that even when conditions are favorable the project still does not attain its expected result.
METHODOLOGIES OF IMPACT EVOLUTIONOther strategies of case selection are available; further strategies and a more extended discussion can be found in Chapter 5 of Gerring (2007).
For the purposes of project evaluation at USAID, however, these three appear likely to be the most useful.
Because of the varied contexts in which even “typical” USAID projects are implemented, it would be best to conduct impact evaluations to determine the effects of such projects in several different places. Ideally, USAID could choose a “typical” case, a least likely case, and a most likely case for evaluation to determine whether a project is having its desired impact. Even if this spread is not readily available, choosing two or three different sites to evaluate widely used projects would help address concerns about generalizability more effectively than using only a single site for an impact evaluation.
Building knowledge It is important to keep in mind that no single evaluation is likely to be regarded as complete evidence for or against a project, nor should it.
Regardless of how carefully an evaluation is designed, there is always the possibility of random error—factors at work in a country or some sector of a country that cannot be controlled by carefully constructed evaluation designs. More importantly, there is always the possibility that an intervention may work differently in one setting than it does in others. Thus the process of evaluating projects should always involve multiple evaluations of the same basic intervention. This means that strategies of evaluation must take into account the body of extant knowledge on a subject and the knowledge that may arise from future studies (supported by USAID, other agencies, or the academic community). This is the process of building knowledge. The most successful companies in the private sphere tend to be “learning organizations” that constantly build knowledge about their own activities (Senge 2006). This process may be disaggregated into four generic goals: building internal alidity, building external alidity, building better project design, and building new knowledge.
The first three issues may be understood as various approaches to “replication.” If USAID is concerned about the internal validity of an impact evaluation, subsequent evaluations should replicate the original research design as closely as possible. If USAID is concerned about the external validity of an evaluation, then replications should take place in different sites. If USAID is concerned with the specific features of a project, replications should alter those features while keeping other factors constant. The fourth issue departs from the goal of replication; here the goal is to unearth new insights into the process of development and the degree to which it may be affected by USAID policies. In this instance it is no longer so important to replicate features of previous evaluations.
IMPROVING DEMOCRACY ASSISTANCE Even so, the committee emphasizes that the important features of a research design—the treatment, the outcomes anticipated to result from the treatment, and the setting—should be standardized as much as possible across each evaluation. Doing so helps ensure that the results of the evaluation will be comparable to evaluations of similar projects, so that knowledge accumulates about that subject. If the treatments and evaluation designs change too much from evaluation to evaluation, less can be learned.
Using impact evaluations in no way reduces the need for sound judgment from experienced DG staff; detailed knowledge of the country and specific conditions is essential for creating a good impact design. More generally, there are often external events that can have consequences for an ongoing project or its evaluation. In such cases an experienced DG officer will need to appraise the effect of these events on the project’s process and outcomes. However, an appropriate mix of evaluations offers better information about projects on which DG staff can create new, more effective policy.
A TyPOLOgy OF IMPACT EvALUATION DESIgNS A major goal of this chapter is to identify a reasonably comprehensive, yet also concise, typology of research designs that might be used to test the causal impact of projects supported by USAID’s DG office. Six basic research designs seem potentially applicable: (1) large N with random assignment of the project1; (2) large N comparison without randomized assignment of the project; (3) small N with randomized assignment of the project; (4) small N without randomized assignment of the project;
(5) N = 1, where USAID has control of where or when the project is put in place; and (6) N = 1, where USAID has little control over where or when the project is placed. Each option is summarized in Table 5-1.
Each research design shown in the table shares a dedicated effort to collect pre- and posttreatment measures of the policy outcomes of interest.
Hitherto, baseline measurements have been an inconsistent part of USAID evaluations (Bollen et al 2005); although baseline data are generally supposed to be collected as part of current program monitoring, the quality may vary substantially. The absence of good baseline data makes it much more difficult to demonstrate a causal effect. No project can be adequately tested without a good measurement of the outcome of interest prior to the policy intervention. Naturally, such a measurement should be paired with a corresponding measurement of the outcome after the policy intervenRandomized assignment of a treatment is often called an experiment in texts on research
tion. (See Chapters 6 and 7 for further discussion of appropriate measures of outcomes, with examples from the committee’s field visits.) Together, these provide pre- and posttests of the policy intervention.
In the large N randomized assignment design—but only in that case— it is possible to evaluate project outcomes even in the absence of baseline data, as shown, for example, in Hyde (2006), where she evaluated the impact of election monitors from observed differences in the votes received by opposition parties in precincts with and without the randomly assigned monitors. However, this procedure always assumes that the intervention and control groups would show similar outcomes in the absence of any intervention. It is better, wherever possible, to check this assumption with baseline data. This is particularly important when the number of cases is modest and full randomization is not possible, and many other factors besides the intervention can affect outcomes. Even in the case of the large N randomization, baseline data are often useful for checking the assumptions on which programming is based, or for planning or evaluating other projects later.2 The six research design options are distinguishable from one another along two key dimensions: (1) the number of units (N) available for analysis and (2) USAID’s capacity to manipulate key features of the project’s design and implementation. Usually, the capacity to evaluate projects is enhanced when N is large (i.e., when there are a large number of individuals, organizations, or governments that can be compared to one another) and when the project can be implemented in a randomized way. The large N randomized intervention is thus regarded as the “gold standard” of project evaluation methods (Wholey et al 2004). Each step away from the large N randomized design generally involves a loss in inferential power or, in other words, less confidence in the ability to make inferences about causal impact based on the results of the evaluation.
Even so, this certainly does not imply, and the committee is not arguFor examples, see the research papers on the Poverty Action Lab of MIT webpage: http://