# «Robert Rohde1, Judith Curry2, Donald Groom3, Robert Jacobsen3,4, Richard A. Muller1,3,4, Saul Perlmutter3,4, Arthur Rosenfeld3,4, Charlotte Wickham5, ...»

The other analysis groups generally discuss a concept of “bias error” associated with systematic biases in the underlying data (e.g. Brohan et al. 2006; Smith and Reynolds 2005). To a degree these concepts overlap with the discussion of “structural error” in that the prior authors tend to add extra uncertainty to account for factors such as urban heat islands and instrumental changes in cases when they do not directly model them. Based on graphs produced by HadCRU, such “bias error” was considered to be a negligible portion of total error during the critical 1950period of modern warming, but leads to an increase in total error up to 100% circa 1900 (Brohan et al. 2006). In the current presentation we will generally ignore these additional uncertainties which will be discussed once future papers have examined the various contributing factors individually.

8. Statistical Uncertainty – Overview Statistical uncertainty is a reflection of the errors introduced into the determination of model parameters due to the fact that the basic data, ( ), may not be an accurate reflection of the true temperature history. In order to place uncertainties on the global mean temperature time series ̂ ( ), we apply two approaches, a systematic “sampling” method, and a “jackknife” method (Miller 1974, Tukey 1958, Quenouille 1949).

These approaches are both different from the approaches that have been commonly used in the past. Prior groups generally assess uncertainly from the bottom-up by assigning uncertainty to the initial data and all of the intermediate processing steps. This is a complicated process due to the possibility of correlated errors and the risk that those uncertainties may interact in unexpected ways. Further, one commonly applies the same amount of data uncertainty to all records, even though we would expect that some records are more accurate than others.

As an alternative, we approach the statistical uncertainty quantification from a top-down direction. At its core, this means measuring how much our result would change if there were variations in the amount of data available. By performing the entire analysis chain with small variations in the amount of data available we can assess the impact of data noise in a way that bypasses concerns over correlated error and varying record uncertainty. For a complex analysis system this will generally provide a more accurate measure of the statistical uncertainty, though there are some additional nuances.

9. Statistical Uncertainty – Sampling Method The sampling method we apply relies on subsampling the station network, recomputing the temperature time series, and examining the variance in the results across the different samples. In the implementation we used for the current paper, each station is randomly assigned to one of five groups. Each of these groups can be expected to have similar, but somewhat diminished, spatial coverage compared to the complete sample. For each group of stations we reapply the averaging process. This leads to a set of new temperature time series ̂ ( ), where the n index denotes the subsample number. As each of these new time series is created from a completely independent station network, we are justified in treating their results as statistically independent.

For each subsampled network, we compute the mean temperature for an arbitrary period, e.g. Jan 1950 to Dec 2000, and subtract this from the data; this gives us five subsampled records that have the same temperature “anomaly.” We do this to separate out the uncertainty associated with relative changes in the global land-surface time series from the larger uncertainty associated with the estimation of the Earth’s absolute mean temperature. We then estimate the statistical uncertainty of ̂( ) as the standard error in the mean of the subsampled values, namely ∑ ( ̂ ( ) 〈 ̂ ( )〉) [35] √ () ∑ where 〈 ̂ ( )〉 denotes the mean value. In general, the denominator will be 5 at times where all five subsamples report a value. However, since the different subsamples may have somewhat different time coverage, the number of records reported at early times might be different. We require at least three subsamples report a value in order for an uncertainty to be reported.

Examples of subsampled temperature series and the resulting uncertainty will be provided with the discussion of GHCN results.

The sampling value could be further refined. One method would be to repeat this entire process of creating five subsamples through multiple iterations and average the results.

Unfortunately, though conceptually simple and computationally efficient, the sampling method suffers from a flaw that leads to a systematic underestimation of the statistical uncertainty in our context. In forming each subsampled network, 80% of stations must be eliminated. This increases the effect of spatial uncertainty associated with each of these subsamples. Further, due to the highly heterogeneous history of temperature sampling, the newly unsampled regions in each subnetwork will tend to overlap to a substantial degree leading to correlated errors in the uncertainty calculation. Based on a variety of Monte Carlo experiments, we concluded that the sampling estimates of uncertainty tend to understate the true error by between 10 and 100% depending on the distribution of the temperature-monitoring network at the time.

10. Statistical Uncertainty – Jackknife Method The “jackknife”, a method developed by Quenoille and John Tukey, is our primary method for determining statistical uncertainty (Tukey 1958, Quenoille 1949, Miller 1974). It is a special modification of the sampling approach, finding its traditional use when the number of data points is too small to give a good result using ordinary sampling. Given the fact that we have many thousands of stations in our records, each with typically hundreds of data points, it was surprising to us that this method would prove so important. But despite our large set of data, there are time and places that are sparsely sampled. As noted above, the presence of this sparse sampling tends to cause the sampling technique to underestimate the statistical uncertainty.

We use the jackknife method in the following way. Given a set of stations (7280, when using the GHCN compilation) we construct 8 station groups, each consisting of 7/8 of the data, with a different 1/8 removed from each group. The data from each of these data samples is then run through the entire Berkeley Average machinery to create 8 records ̂ ( ) of average global land temperature vs. time. Following Quenouille and Tukey, we then create a new set of 8 “effectively independent” temperature records ̂ ( ) by the jackknife formula

where ̂ ( ) is the reconstructed temperature record from the full (100%) sample. Hence we

**calculate the standard error among the effectively independent samples:**

We indeed found that the typical statistical uncertainties estimated from the jackknife were, in general, larger than those estimated from the sampling method. As the jackknife constructs its temperature average using a station network that is nearly complete, it is more robust against spatial distribution effects. In addition, we can more easily increase the number of samples without worrying that the network would become too sparse (as could happen if one increased the number of divisions in the sampling approach).

We studied the relative reliability of the sampling and jackknife methods using over 10,000 Monte Carlo simulations. For each of these simulations, we created a toy temperature model of the “Earth” consisting of 100 independent climate regions. We simulated data for each region, using a distribution function that was chosen to mimic the distribution of the real data; so, for example, some regions had many sites, but some had only 1 or 2. This model verified that sparse regions caused problems for the sampling method. In these tests we found that the jackknife method gave a consistently accurate measure of the true error (known since in the Monte Carlo we knew the “truth”) while the sampling would consistently underestimate the true error.

When we discuss the results for our reanalysis of the GHCN data we will show the error uncertainties calculated both ways. The jackknife uncertainties are larger than those computed via sampling, but based on our Monte Carlo tests, we believe them to be more accurate.

**and then define the spatial uncertainty in ̂( ) as:**

[39] ∑ ( )) (( ) √ () ∑

The new symbol ( ⃑ ) is introduced to focus the estimates of local variance on only those times when at least 40% of the variance has been determined by the local data. In addition, ̂(⃑ ) provides a correction to the magnitude of the fluctuations in ̂ ( ⃑ ) in the the term (⃑ ) presence of incomplete sampling. Recall that ̂ ( ⃑ ) as ( ⃑ ), which reflects the fact that there can be no knowledge of the local fluctuations in the field when no data is available in the local neighborhood.

( ) from equation [39] tends to be 30-50% larger than the result The estimate of of equation [40] at early times (e.g. pre-1940). We believe this is because the linearized error propagation formula in equation [40] and the approximate correlation function estimated in equation [14] don’t capture enough of the structure of the field, and that the formulation in equation [39] is likely to be superior at early times. At late times the two results are nearly identical; however, both estimates of the uncertainty due to spatial incompleteness at late times tend be far lower than the statistical uncertainty at late times. In other words, at times where the spatial coverage of the Earth’s land surface is nearly complete, the uncertainty is dominated by statistical factors rather the spatial ones.

As noted above, the empirical uncertainty estimate of equation [39] is partially limited due to incomplete sampling during the target interval. To compensate for this we add a small analytical correction, determined via equation [40] in the computation of our final spatial uncertainty estimates at regions with incomplete sampling. This correction is essentially negligible except at late times.

12. GHCN Results The analysis method described in this paper has been applied to the 7280 weather stations in the Global Historical Climatology Network (GHCN) monthly average temperature data set developed by Peterson and Vose 1997; Menne and Williams 2009. We used the nonhomogenized data set, with none of the NOAA corrections for inhomogeneities included; rather, we applied our scalpel method to break records at any documented discontinuity. We used the empirical scalpel method described earlier to detect undocumented changes; using this, the original 7,280 data records were broken into 47,282 record fragments. Of the 30,590 cuts, 5218 were based on gaps in record continuity longer than 1 year and the rest were found by our empirical method. We also found a small number of nonsense data points in the raw data, for example, values exceeding 70 C, records filled with zeros, or other repeated strings of data; these were eliminated by a pre-filtering process. In total, 0.8% of the data points were eliminated for such reasons. The NOAA analysis process uses their own pre-filtering in their homogenization and averaging processes, but we chose to handle them directly due to our preference for using the raw GHCN data with no prior corrections. A further 0.2% of data was eliminated because after cutting and filtering the resulting record was either too short to process (minimum length ≥6 months) or it occurred at a time with fewer than 5 total stations active.

It is worth making a special point of noting that after cutting and processing, the median length of a temperature time series processed by the Berkeley Average was only 7.1 years.

Further, the inner 50% range for station record lengths was 2.7 to 12.8 years. As already stated, our climate change analysis system is designed to be very tolerant of short and discontinuous records which will allow us to work with a wider variety of data than is conventionally employed.

Figure 4 shows the station locations used by GHCN, the number of active stations vs.