# «Robert Rohde1, Judith Curry2, Donald Groom3, Robert Jacobsen3,4, Richard A. Muller1,3,4, Saul Perlmutter3,4, Arthur Rosenfeld3,4, Charlotte Wickham5, ...»

There are in general two kinds of evidence that can lead to an expectation of a discontinuity in the data. The first is “metadata”, such as documented station moves or instrumentation changes. For the current paper, the only “metadata” cut we use is based on gaps in the record; if a station failed to report temperature data for a year or more, then we consider that gap as evidence of a change in station conditions and break the time series into separate records at either side of the gap. In the future, we will extend the use of the scalpel to processes such as station moves and instrumentation changes; however, the analysis presented below is based on the GHCN dataset which does not provide the necessary metadata to make those cuts.

The second kind of evidence requiring a breakpoint is an apparent shift in the statistical properties of the data itself (e.g. mean, variance) when compared to neighboring time series that are expected to be highly correlated. When such a shift is detected, we can divide the data at that time, making what we call an “empirical breakpoint”. The detection of empirical breakpoints is a well-developed field in statistics (Page 1955, Tsay 1991, Hinkley 1971, Davis 2006), though relatively little work has been done to develop the case where spatially correlated data are widely available. As a result, the existing groups have each developed their own approach to empirical change point detection (Menne and Williams 2009; Jones and Moberg 2003, Hansen et al. 1999).

In the present paper, we use a simple empirical criterion that is not intended to be a complete study of the issue. Like prior work, the present criterion must be applied prior to any averaging.

In principle, change point detection could be incorporated into an iterative averaging process that uses the immediately preceding average to help determine a set of breakpoints for the next iteration; however, no such work has been done at present. For the present paper, we follow NOAA in considering the neighborhood of each station and identifying the most highly correlated adjacent stations. A local reference series is then constructed by a weighted average of the neighboring stations. This is compared to the station’s records, and a breakpoint is introduced at places where there is an abrupt shift in mean larger than 4 standard deviations.

This empirical technique results in approximately 1 cut for every 12.2 years of record, which is somewhat more than the changepoint occurrence rate of one every 15-20 years reported by Menne et al. 2009. Future work will explore alternative cut criteria, but the present effort is meant merely to incorporate the most obvious change points and show how our averaging technique can incorporate the discontinuity adjustment process in a natural way.

5. Outlier Weighting The next potential problem to consider is point outliers, i.e. single data points that vary greatly from the expected value as determined by the local average. Removal of outliers is done by defining the difference between a temperature stations report and the expected value at that

**same site:**

[28] ( ( )) ( ) { ⁄| ( )| Equation [28] specifies a downweighting term to be applied for point outliers that are more than from the modeled expectation. This outlier weighting is used to define a modified expression for ̂

and is also incorporated into the site weighting discussed below.

This choice of target threshold,, is partly arbitrary but was selected with the expectation that most of the measured data should be unaffected. If the underlying data fluctuations were normally distributed, we would expect this process to crop 1.25% of the data.

In practice, we observe that the data fluctuation distribution tends to be intermediate between a normal distribution and a Laplace distribution. In the Laplace limit, we would expect to crop 2.9% of the data, so the actual exclusion rate can be expected to be intermediate between 1.25% and 2.9% for the typical station record.

Of course, the goal is not to remove legitimate data, but rather to limit the impact of erroneous outliers. In defining equation [28], we adjusted the weight of outliers to a fixed target,, rather than to simply downweight them to zero. This helps to ensure numerical stability.

6. Reliability Weighting In addition to point outliers, climate records often vary for other reasons that can affect an individual record’s reliability at the level of long-term trends. For example, we also need to consider the possibility of gradual biases that lead to spurious trends. In this case we assess the overall “reliability” of the record by measuring each record’s average level of agreement with the expected field ̂ ( ⃑ ) at the same location.

**For each station we compute a measure of the quality of fit:**

[30] ∑ ( ( )) ∑ The “min” is used to avoid giving too great a weight to the most extreme outliers when

**judging the reliability of the series. The station weight is then defined as:**

[31] Due to the limits on outliers from the previous section, the station weight has a range between 1/13 and 2, effectively allowing a “perfect” station record to receive up to 26 times the weight of a “terrible” record. This functional form was chosen for the station weight due to several desirable qualities. The typical record is expected to have a weight near 1, with poor records being more severely downweighted than good records are enhanced. Using a relationship that limits the potential upweighting of good records was found to be necessary in order to ensure efficient convergence and numerical stability. A number of alternative weighting and functional forms with similar properties were also considered, but we found that the construction of global temperature time series were not very sensitive to the details of how the downweighting of inconsistent records was handled.

After defining the station weight, we need to incorporate this information into the spatial averaging process, e.g. equation [13], by adjusting the associated Kriging coefficients. Ideally, one might use the station weights to modify the correlation matrix (equation [12]) and recompute the Kriging coefficients. However, it is unclear what form of modification would be appropriate, and frequent recomputation of the required matrix inverses would be computationally impractical. So, we opted for a more direct approach to the reweighting of the Kriging solution.

**We define updated spatial averaging coefficients:**

(⃑ ) [32] (⃑ ) (∑ (⃑ )) ( (⃑ ))

and the desire to leave the expected variance of the right hand side unchanged after reweighting.

Because ( ⃑ ) ∑ ( ⃑ ) it follows that ( ⃑ ) is equal to ( ⃑ ) if all the station weights are set to 1. The ( ( ⃑ )) term in the denominator can be understood as measuring the influence of the global mean field, rather than the local data, in the construction of the local average temperature estimate. The omission of this term in equation [32] would lead to a weighting scheme that is numerically unstable.

It is important to note that equation [32] merely says that the local weather average ̂ ( ⃑ ) should give proportionally greater weight to more reliable records. However, if all of the records in a given region have a similar value of, then they will all receive about the same

̂ and ̂ ( ⃑ ) are now used to replace the original values in the execution of the model. In order to ensure robustness, this process of determining site and outlier weights is repeated many times until the parameter values stabilize. We find that we typically require 10 to 30 iterations to satisfy our convergence criteria.

Implicit in the discussion of station reliability considerations are several assumptions.

First, we assume that the local weather function constructed from many station records, ̂ ( ⃑ ), will be a better estimate of the local temperature than any individual record could be.

This assumption is generally characteristic of all averaging techniques; however, we can’t rule out the possibility of large-scale systematic biases. Our reliability adjustment techniques can work well when one or a few records are noticeably inconsistent with their neighbors, but large scale biases affecting many stations could cause such comparative estimates to fail. Second, we assume that the reliability of a station is largely invariant over time. This will in general be false;

however, the scalpel procedure discussed previously will help us here. By breaking records into multiple pieces on the basis of metadata changes and/or empirical discontinuities, we then also have the opportunity to assess the reliability of each fragment individually. A detailed comparison and contrast of our results with those obtained using other approaches to deal with inhomogeneous data will be presented elsewhere.

7. Uncertainty Analysis We consider there to be two essential forms of quantifiable uncertainty in the Berkeley

**Earth averaging process:**

1. Statistical / Data-Driven Uncertainty: This is the error made in estimating the parameters ̂ and ̂( ) due to the fact that the data, ( ), may not be an accurate reflection of the true temperature changes at location ⃑.

2. Spatial Incompleteness Uncertainty: This is the expected error made in estimating the true land-surface average temperature due to the network of stations having incomplete coverage of all land areas.

In addition, there is “structural” or “model-design” uncertainty, which describes the error a statistical model makes compared to the real-world due to the design of the model. Given that it is impossible to know absolute truth, model limitations are generally assessed by attempting to validate the underlying assumptions that a model makes and comparing those assumptions to other approaches used by different models. For example, we use a site reliability weighting procedure to reduce the impact of anomalous trends (such as those associated with urban heat islands), while other models (such as those developed by GISS) attempt to remove anomalous trends by applying various corrections. Such differences are an important aspect of model design. In general, it is impossible to directly quantify structural uncertainties, and so they are not a factor in our standard uncertainty model. However, one may be able to identify model limitations by drawing comparisons between the results of the Berkeley Average and the results of other groups. Discussion of our results and comparison to those produced by other groups will be provided below.

Another technique for identifying structural uncertainty is to run the same model on multiple data sets that differ primarily based on factors that one suspects may give rise to unaccounted for model errors. For example, one can perform an analysis of rural data and compare it to an analysis of urban data to look for urbanization biases. Such comparisons tend to be non-trivial since it is rare that one can construct data sets that isolate the experimental variables without introducing other confounding variations. We will not provide any such analysis of such experiments in this paper; however, additional papers submitted by our group (Wickham et al. submitted; Muller et al. submitted) find that objective measures of station quality and urbanization do not have with a statistically significant impact on our results over most of the available record. In other words, the averaging techniques combined with the bias adjustment procedures we have described appear adequate for dealing with those data quality issues to within the limits of the uncertainties that nonetheless exist from other sources. The one possible exception is that Wickham et al. observed that rural stations may slightly overestimate global land-surface warming during the most recent decade. The suggested effect is small and opposite in sign to what one would expect from an urban heat island bias. At the present time we are not incorporating any explicit uncertainty to account for such factors, though the data driven uncertainty will implicitly capture the effects of variations in data behavior across the field.