As stressed above, all sound and credible impact evaluation designs share three characteristics: (1) they collect reliable and valid measures of the outcome that the project is designed to affect, (2) they collect such outcome measures both before and after the project is implemented, and (3) they compare outcomes in both the units that are treated and an appropriately selected set of units that are not. As long as the number of units (N) to be treated is greater than one, all three of these attributes of impact evaluation are possible. The major difference between randomized evaluations and other methodologies lies in the degree to which project designers need to concern themselves with the number and selection of control units. In a randomized evaluation the law of large numbers does the job of ensuring that the treatment and control groups will be (within the limits of statistical significance) identical across all the factors that might affect the project impacts being measured. When random assignment is not possible, project designers must pay close attention to the factors that might be associated with inclusion in the control or treatment groups—what social scientists refer to as “selection bias”—and the effects of those factors on the differences found between the control and treatment units. These are the approaches referred to as large N and small N comparisons in Chapter 5.

Aside from the fact that the implementer does not select treated units at random, the examples described below are very similar to the randomized designs. In particular, they share the key characteristics that reliable and valid measures of project outcomes still must be collected both before and after project implementation and for treatment and comparison groups. As with randomized designs, the discussion proceeds by providing examples of best practices. All four examples highlight the importance of finding an appropriate way to identify a control group, while the latter two also emphasize creative ways to improve measurement.

National “Barometer” Surveys as a Means to Design Impact Evaluations for Localized USAID Project Interventions For a variety of reasons, USAID often implements programs at a subnational level, applying its efforts in a selected set of municipalities or departments or regions. Often the selection of these regions is determined by programmatic considerations. For example, USAID might determine  IMPROVING DEMOCRACY ASSISTANCE that it wants to focus its resources on the poorest areas of the country or on areas that have suffered the most from civil conflicts or have been hit with natural disasters. In other cases USAID decides to focus on municipalities or regions that look the most promising for the success of a particular intervention. In still other cases, USAID engages with other donors to “divide up the pie” with, for example, the European Union agreeing to work in the north while USAID works in the south. Finally, there may be entirely idiosyncratic reasons for the choice of where to work (and where not to work) related to the preferences of individual host governments or implementers.

In each subnational project the principle of randomized selection is violated and the possible confounding effect of “selection bias” would be an important factor in designing an impact evaluation. The nonrandom selection may bias the impact so that, ceterius paribus, the results may be better than they would have been had randomization been used to select the treatment area or they could be worse. It is impossible to know beforehand exactly what to expect. The point is that those who wish to study impact will worry that selection bias by itself could be responsible for any measured “impact,” rather than the project itself.

Consider a project carried out in an exceptionally poor area. One possible outcome is that the area is so poor, and conditions so grim, that short of extraordinary investment, citizens will not really notice a difference.

Similarly, in a post-civil war conflict, feelings of hatred and distrust may be so deeply ingrained that project investments will be ignored entirely.

In these cases, even though the project may have been designed well, any impact is imperceptible. On the other hand, in both cases, the very low starting point suggests (as noted in the Peru example below) that a “regression to the mean” is inevitable and therefore improvements will occur with or without the project intervention. In such a case a positive impact might mistakenly be attributed to the project when, in fact, the gains are occurring for reasons entirely unrelated to the inputs.

When randomization is not possible, but selection of multiple treatment and control areas is, conditions are ideal for the “second-best” method of large N nonrandomized designs. This sort of design is often referred to as “difference in difference” (DD; Bertrand et al 2004). The objection to this approach, however, is that USAID would be spending its limited resources to study regions or groups in which it does not have projects and may not plan to have any. The committee believes that this entirely understandable (indeed compelling) reason alone constrains many DG programs and project implementers from considering a design that would be seen as “wasting” money and effort on studies in areas where USAID is not working.

The committee believes that USAID already posseses the ability to 


overcome this problem of “wasting money” on seemingly irrelevant controls without significant additional investment of resources. The agency’s Latin American and Caribbean bureau, for example, is already applying this methodology in some of its projects in a limited number of instances.4 The approach to reduce (but not eliminate) the risks of potentially misleading conclusions is to utilize the increasingly prevalent public opinion surveys being carried out in Latin America, Africa, and Asia, collectively known as the Barometer surveys. High-quality nationally representative surveys are regularly being carried out by consortia of universities and research institutions, many with the assistance of USAID but also with the support of other donors, such as the Inter-American Development Bank, the United Nations Development Program, the European Union, and local universities in the United States and abroad. These surveys provide fairly precise and reliable estimates of the “state of democracy” at the grassroots level, by producing a wide variety of indicators. For example, the surveys reveal the frequency and nature of corruption, victimization, and the level of citizen participation in local government, civil society, and the judicial process. They also produce measures of satisfaction with institutions such as town councils, regional administrations, the national legislature, courts, and political parties and the willingness of citizens to support key democratic principles such as majority rule and tolerance for minority rights.

These surveys also allow for disaggregation by factors such as gender, level of urbanization, region, and age cohort.

Given that investments are already being made in the Barometer surveys, they provide a “natural” and no-added-cost control group to studies of project impact. They provide, in effect, a picture of the “state of the nation” against which special project areas can be measured. In other words, USAID would continue to gather baseline and follow-up surveys in its project towns, municipalities, or regions and thus concentrate its limited funds on collecting detailed impact data for the places or institutions in which it is carrying out its projects. It would not need to carry out interviews of control groups for which it does not have ongoing projects. The national-level control group, however, could be used to show differences between the nation and the project areas in terms of not only poverty, degree of urbanization, and so forth but also many of the project impact measurements that USAID requires to determine project success or failure. For example, if a project goal is to increase participation of rural women in local government, comparisons could be made between the baseline and the national averages, and then, following the DD logic, 4 The committee believes, but was unable to document, that this method has been utilized

–  –  –

comparisons would be made over time as the project impact is supposed to be occurring.

There are several recent examples to illustrate this. For many years USAID focused a considerable component of its DG projects in Guatemala on institution building at the national level, especially the legislature.

Surveys carried out by the Latin American Public Opinion Project as part of its Americas Barometer studies, found a deep distrust in those institutions, despite years of effort and investment. It also found special problems in the highland indigenous areas. In part as a result of those surveys, the DG programs in Guatemala began to shift, focusing more on citizens and less on institutions. As part of that strategy, every two years national samples were carried out, along with targeted special samples (what USAID calls “oversamples”) in the highland indigenous municipalities.

A finding from those surveys was the low level of political participation among some sectors of the population. In 2006 those surveys were used to focus the “get out the vote” campaign for the 2007 election, a critical one in which a former military officer was a leading candidate.

In Ecuador a series of specialized samples have been drawn in specific municipalities, with the results being systematically compared to national samples, drawn every two years since 2001. CARE, in cooperation with the International Migration Organization, has been working in a series of municipalities along the border with Colombia, a region in which the possible spread of narco-guerrilla activities could have an adverse impact on Ecuador. Thus the municipalities were not selected at random, but national-level survey data have allowed for comparison of starting levels, so that those implementing the project would have far more than anecdotal information about the level of citizen participation in and satisfaction with local government. The survey data also allow for comparisons over time to see if trends in the project areas are more favorable than in the nation as a whole. Similar efforts have taken and are taking place in Honduras, Nicaragua, Colombia, Peru, and Bolivia.

Surveys have also increasingly been used to measure the impact of anticorruption programs, in some cases by comparing “before” and “after” impacts on a specific sector (e.g., health in Colombia) and in other cases comparing the results for the nation as a whole before and after implementation of an anticorruption program (Seligson 2002, 2006;

Seligson and Recanatini 2003). The most recent survey of citizen perceptions of and experience with corruption, supported by USAID/Albania, was released while the committee’s team was in Albania (Institute for Development Research Alternatives 2007).

For this approach to be successful, national surveys, as well as specific surveys carried out in project areas, need to be at least minimally coordinated so that the questions asked in both are identical. It is well 


known that small differences in question wording or scaling can substantially affect the pattern of responses. If, for example, local government participation is an impact objective of the mission, problems will arise if the national sample asks respondents whether they have attended a local government meeting and the project sample asks how many times in the past 12 months they attended a local government meeting.

There are two potential objections to this approach. The first is cost.

