Second, with respect to detrending the data, it is helpful if the researcher can gather information on the outcome(s) of interest and any potential confounders for the periods before and after the intervention.

The longer the period of observation, the more confident one can be about any causal inference made (Campbell 1968/1988). Thus, if the outcome 


factor being studied has been stable for a long time before the intervention, and other factors likely to have an impact on the outcome have been ruled out, one can have more confidence that any observed change in the trend was due to the intervention.

Third, with respect to the intervention itself, it is essential that it be discrete and significant enough to be easily observed. While subtle project effects may be detected in a large N randomized design, usually only very large effects can be confidently observed in a single-case setting.

Fourth, it is helpful if the intervention has more than one observable (and policy-significant) effect. This goes some way toward resolving the ever-present threats of measurement error and confounding causes. If, for example, a given intervention is expected to produce changes in three measurable independent outcomes, and all three factors change in the aftermath of an intervention, it is less likely that the noted association is spurious.

When the unit of concern exists only at the national level and the treatment cannot be manipulated by USAID, discerning causal effects is extraordinarily difficult. Observed differences in outcome measures preand posttreatment can be interpreted as causal effects only if the evaluator can make the case that other factors were not important.

Some of the strategies described above are applicable in an N = 1 comparison if the treatment can be interpreted “as if” it was manipulated (e.g., Miron 1994). Any demonstration of a large discontinuous change in an outcome of interest following the treatment increases confidence in the causal interpretation of the effect. This requires an effort to measure the outcome(s) of interest prior to, and after, the intervention.

In some cases it may be possible to identify units for comparison within the country or outside the country, in order to rule out obvious temporal confounds. Take the example of an anticorruption effort funded in a specific ministry. If it can be shown that corruption levels remained unchanged in untreated ministries while shifting dramatically in a treated ministry, we gain confidence that a government-wide anticorruption effort cannot account for the effects observed in the treated ministry. But the possibility cannot be ruled out that other developments in the treated ministry (such as good leadership) are more important than the intervention in accounting for the outcome. Or take the example of a national anticorruption effort that is rolled out in one country but not in adjacent countries or at different times in adjacent countries. Changes in outcome variables in the other countries could be tracked to seek the effects of the program; if reductions in corruption occur to a greater degree, or in a timed sequence that corresponds to the timing of roll-outs in different  IMPROVING DEMOCRACY ASSISTANCE countries, one can have confidence that it is not regional or global trends that were driving the reductions in corruption. On the other hand, as in the previous example, the possibility could not be ruled out that other factors, such as freer media or stronger leadership, were the key causal factors in reducing corruption rather than the specific USAID project, unless there were also measures of those possible confounding factors.

Not all USAID DG programs need to be subjected to rigorous impact evaluation. For example, if USAID is working to help a country pass a new constitution with certain human rights provisions, and several other NGOs and foreign countries are also working to that end, it may not matter how much USAID’s specific activities contributed to a successful outcome; success is what matters and credit can be shared among all who contributed. (On the other hand, a subsequent impact evaluation of whether the new constitution actually resulted in an improvement in human rights—an N = 1 comparison designed to plot changes in human rights violations over time and look for sharp reductions following adoption of the new constitution—may be worthwhile.) In particular, the random assignment mode of impact evaluation is probably best used only where the fair assignment of assistance naturally results in a randomized assignment of aid or where USAID uses a project in so many places, or invests so much in a project, that it is of great importance to be confident of that project’s effectiveness. In most settings, worthwhile insights into project impacts can be derived from designs that include small N comparisons, as long as good baseline, outcome, and comparison group data are collected.




Randomized designs have a high degree of internal validity. By permitting a comparison of outcomes in a treatment group and a control group that can be considered identical to one another, they do a better job than any other evaluation technique of permitting evaluators to identify the impact of a given intervention. It is no surprise, therefore, that randomized evaluation is the industry standard for the assessment of new medications. It is inconceivable that a pharmaceutical company would be permitted to introduce a new medication into the market unless evidence from a randomized evaluation proved its benefits. Yet as discussed in Chapter 2, for the assessment of DG assistance programs, impact evaluations have rarely been employed. This leaves USAID in the difficult position of spending hundreds of millions of dollars on assistance programs without proven effects.


There are a small, but important, number of large N randomized impact evaluations that have been carried out to test the effects of assistance programs. Classic evaluations, such as the RAND health insurance study and the evaluation of the Job Training Partnership Act (JTPA), stand out as exemplars of large-scale assessments of social assistance programs (Wilson 1998, Gueron and Hamilton 2002, Newhouse 2004). A few have been done in developing countries; the evaluation of Mexico’s conditional cash transfer program, Progresa/Oportunidades, continues to shape the design of similar programs in other contexts (Morley and Coady 2003).

The number of such evaluations is growing. In fields as diverse as public health, education, microfinance, and agricultural development, randomized evaluations are increasingly employed to assess project effectiveness. Examples abound in the field of public health: Studies have assessed the efficacy of male circumcision in combating HIV (Auvert et al 2005), the impact of HIV prevention programs on sexual behavior (Dupas 2007), the effectiveness of bed nets for reducing the incidence of malaria (Nevill et al 1996), the impact of deworming drugs on health and educational achievement (Miguel and Kremer 2004), and the role of investments in clean water technologies on health outcomes (Kremer et al 2006). In education, randomized evaluations have been used to explore the efficacy of conditional cash transfers (Schultz 2004), school meals (Vermeersch and Kremer 2004), and school uniforms and textbooks (Kremer 2003) on school enrollment; the effectiveness of additional inputs, such as teacher aids, on school performance (Banerjee and Kremer 2002); and the impact of school reforms, such as voucher programs, on academic achievement (Angrist et al 2006). In microfinance, attention has focused on the impact of programs on household welfare (Murdoch 2005); randomized evaluations in agricultural development are exploring the benefits and impediments to the adoption of new technologies, such as hybrid seeds and fertilizer (Duflo et al 2006a).

Thus far, however, these approaches have not been applied to the evaluation of DG programs. A significant part of the explanation for this is that it is often more difficult to measure outcomes in the area of democratic governance. Most successful randomized evaluations have been conducted in areas such as health and education, where it is much more straightforward to measure outcomes. For example, the presence of intestinal parasites can be measured quite easily and accurately via stool samples (as in Miguel and Kremer 2004); water quality can be assessed via a test for E. coli. content (as in Kremer et al 2006); nutritional improvements can be traced quite readily via height and weight measures; school performance or learning can be tracked easily via test scores (as in Banerjee et al 2007); and teacher absenteeism can be measured with attendance records (as in Banerjee and Duflo 2006). Developing valid and reliable measures  IMPROVING DEMOCRACY ASSISTANCE of the outcomes targeted by DG programs is much more difficult and stands as an important challenge for project evaluation in this area. The challenge is not insurmountable; there have been tremendous improvements over the past decade in the measurement of political participation and attitudes (Bratton et al 2005), social capital and trust (Grootaert et al 2004), and corruption (Bertrand et al 2007, Olken 2007). And as discussed in Chapter 2, USAID has made significant efforts to develop outcome indicators to support its project M&E work.

This chapter closes with two examples of impact evaluations using randomized designs applied to DG subjects that tested commonly held programming assumptions. The first addresses the issue of corruption.

USAID invests significant resources every year in anticorruption initiatives, but questions remain about the efficacy of such investments. Which programs yield the biggest impact in terms of reducing corruption? Some have argued that corruption can be reduced with the right combination of monitoring and incentives provided from above (Becker and Stigler 1974). Of course, the challenge with top-down monitoring is that higher level officials may themselves be corruptible. An alternative approach has emphasized local-level monitoring (World Bank 2004). The argument is that community members have the strongest incentives to police the behavior of local officials, as they stand to benefit the most from local public goods provision. Yet this strategy also has its drawbacks: Individuals may not want to bear the costs of providing oversight, preferring to leave that to others, or community members may be easily bought off by those engaged in corrupt practices. Which strategy most effectively reduces corruption?

Olken (2007) set out to answer this question in Indonesia through a unique partnership with the World Bank. As a nationwide village-level road-building project was rolled out, Olken randomly selected one set of villages to be subject to an external audit by the central government, a second set in which extensive efforts were made to mobilize villagers to participate in oversight and accountability meetings, a third set in which the accountability meetings were complemented by an anonymous mechanism for raising complaints about corruption in the project, and a fourth set reserved as a control group. To measure the efficacy of these

different strategies, Olken constructed a direct measure of corruption:

He assembled a team of engineers and surveyors who, after the projects were completed, dug core samples in each road to estimate the quantity of materials used, interviewed villagers to determine the wages paid, and surveyed suppliers to estimate local prices to construct an independent estimate of the cost of the project. The difference between the reported expenditures by the village and this independent estimate provides a direct measure of corruption. His findings strongly suggest the efficacy of 


