«NOAA Environmental Data Management Framework NOAA is, at its foundation, an environmental information generating organization. Fundamental to ...»
3.2.1. Data Collection Data Collection typically refers to the initial steps of receiving raw data from an environmental sensor or an observing campaign. Collection may also include purchasing commercial datasets, negotiating arrangements for access to data from foreign systems, issuing contracts for data collection, and issuing research grants that may result in the creation of environmental data. NOAA grantees are required by the Data Sharing for NOAA Grants Procedural Directive (10) to include a data sharing plan with their
3.2.2. Data Processing Data Processing includes all the steps necessary to transform raw data into usable data records and to generate the suite of routine data products. Such processing is typically performed by specialized systems that have their own internal data management controls. Users do not normally have direct access to the processing system. However, the design of these systems can have a great impact on the cost to the agency and on the timeliness, preservation, and quality of the resulting data records and products. Processing systems should not be built from scratch for each observing system, because this does not enable the agency to leverage past investments or existing resources.
3.2.3. Quality Control NOAA data should be of known quality, which means that data documentation includes the result of quality control (QC) processes, and that descriptions of the QC processes and standards are available.
QC tests should be applied to data, including as appropriate automated QC in near-real-time, automated QC in delayed-mode, and human-assisted checks. Quality-assurance (QA) processes should be applied to provide validation that observations meet their intended requirements throughout the Data Lifecycle.
QA may also include intercalibration of data from sensors on multiple systems. All QC and QA checks should be publicly described. The results of these checks should be included in metadata as error estimates or flagging of bad or suspect values. Raw data that have not undergone QC should be clearly documented as being of unknown quality.
3.2.4. Documentation Data documentation provides information about the spatial and temporal extents, source, lineage, responsible parties, descriptive attributes, quality, accuracy, maturity, known limitations, and logical organization of the data. Formal, structured documentation is known as metadata. Metadata are critical for documenting and preserving NOAA's data assets. Standardized metadata support interoperability with catalogs, archives, and data analysis tools to facilitate data discovery and use. Correct and complete metadata are essential to ensuring that data are used appropriately and that any resulting analyses are credible.
The core metadata standards for NOAA environmental data are ISO 19115 (content) and ISO 19139 (Extensible Markup Language [XML] schema), as established by the Data Documentation Procedural Directive (9). Some older metadata records use the more limited Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM); these should be converted to ISO and then improved using approaches and tools described on the EDM Wiki *. Conversion of wellstructured metadata (e.g., in FGDC XML) to ISO is relatively straightforward, but non-standard or freeform documentation is more problematic.
3.2.5. Cataloging "Cataloging" is used here in a general sense to refer to all mechanisms established by data providers to enable users to find data. The word "Discovery" is employed below (Section3.3) to refer to the user's act of finding data-- Cataloging enables Discovery.
NOAA environmental data should be readily discoverable because modern research and decisionmaking depend critically on the ability to find relevant data from multiple agencies and disciplines.
Cataloging methods include enabling commercial search engines to index data holdings, establishing formal standards-based catalog services, and building web portals that are thematic, agency-specific, or government-wide. General web searching is often the first step for potential users, so this activity should be supported. However, advanced searching based on location, time, semantics or other data attributes requires formal catalog services.
The proliferation of portals such as data.gov, geo.data.gov, ocean.data.gov, NASA Global Change Master Directory (GCMD), Group on Earth Observations (GEO) Portal and others means that data providers are asked to register multiple times in different sites. This is not scalable and leads to redundant effort and duplicated cataloging of datasets. Data providers should be able to register their service in a single catalog and have other catalogs and portals automatically become aware of the new data. Some of the recommendations in Appendix A address this.
3.2.6. Dissemination "Dissemination" is used here to mean both actively transmitting data and, more typically, enabling users to access data on request. NOAA environmental data should be readily accessible to intended customers as well as other potential users. Many users prefer direct access to online data via internet services that allow customized requests rather than bulk download of static files or delayed access via ordering services for near-line data. For high-volume data collections requiring near-line storage, NOAA data managers should carefully consider cloud hosting strategies and caching algorithms based on usage tracking to maximize the likelihood of popular data being online. Online services should comply with open interoperability specifications for geospatial data, notably those of OGC, ISO/TC211, and Unidata.
Actively transmitting data to operational customers is necessary in some cases. However, establishing new data conduits that are proprietary or duplicative should be avoided. Existing distribution channels should be shared where possible. Commodity hardware and software should be used in preference to custom-built systems. Government-funded open-source technologies * should be considered.
Data should be offered in formats that are known to work with a broad range of scientific or decisionsupport tools. Common vocabularies, semantics, and data models should be employed.
* E.g., Unidata Internet Data Distribution/Local Data Manager (IDD/LDM) system.
3.2.7. Preservation and Stewardship Data preservation ensures data are stored and protected from loss. Stewardship ensures data continue to be accessible (for example, by migrating to new storage technologies) and are updated, annotated or replaced when there are changes or corrections. Stewardship also includes reprocessing when errors or biases have been discovered in the original processing.
The NOAA National Data Centers -- NCDC, NGDC, and NODC -- are operated by NESDIS but perform data preservation and stewardship on behalf of the entire agency. NOAA data producers must establish a submission agreement with one of these data centers as described in the Procedure for Scientific Records Appraisal and Archive Approval (8), and must include archiving costs in their budget. To ensure data produced by grantees are archived, new Federal Funding Opportunities (FFOs) should arrange and budget in advance with a NOAA Data Center for archiving of data to be produced by the funded investigators.
Because an observation cannot be repeated once the moment has passed, all observations should be archived. Not only the raw data but also the accompanying information needed for understanding current conditions of the observation (e.g., satellite maneuver, instrument reports, change history in situ instruments, etc.) should be preserved. In some cases, especially the case of high resolution satellite imagery, strict compliance with this principle would result in substantial additional costs to telecommunications networks and data storage systems. In those cases where that cost is not budgeted, following a cost/benefit analysis, the issue will be brought to the NOSC for guidance as to whether additional funds should be requested through the budget process.
The representation of data that needs to be preserved and stewarded for the long-term should be negotiated with the Data Center and identified in the relevant data management plan. Key derived products, or the relevant versions of software necessary to regenerate products that are not archived, should also be preserved. The Procedure for Scientific Records Appraisal and Archive Approval (8) defines a process and includes a questionnaire to determining what to archive.
Some numerical model outputs should be preserved. These outputs are often voluminous or ephemeral, and what subset to archive should be carefully considered. The criteria for such decisions are outside the scope of this Framework.
Data rescue refers to the preservation of data that are at risk of loss. Such data include information recorded on paper, film, or obsolete media, or lacking essential metadata, or stored only in the scientist's computer. Data rescue is expensive--much more expensive than assuring the preservation of
Data that has been sent to a NOAA Data Center should also be discoverable and accessible as described in the preceding sections. Ideally, the mechanisms for cataloging and disseminating archival data should be interoperable with those for near-real-time data.
3.2.8. Final Disposition Each NOAA National Data Center already has a records retention schedule that documents the length of time it will retain particular classes of data and product. Each data producer should also have a records retention schedule indicating when their data should be transferred to a Data Center for long term preservation. As IT resource consolidation and reduction occurs, it will become increasingly necessary to transfer custody of data records from local servers and services to NOAA Data Centers.
Retirement and eventual removal of archived material requires resources to update metadata, to request and respond to public comments, and to provide public notification of removal. The metadata record might be preserved indefinitely.
3.2.9. Usage Tracking Usage tracking refers to NOAA's ability to measure how often datasets are being used. Crude estimation can be made by counting data requests or data transmission volumes from Internet servers. However, such statistics do not reveal whether data that was obtained was actually used, or if used whether it was helpful, or whether the initial recipient redistributed the data to other users.
More sophisticated means of assessing usage while preserving the anonymity of users are desirable.
NOAA data producers, in collaboration with a NOAA Data Center, should assign persistent identifiers to each dataset, and include the identifier in every metadata record and data file. The Data Citation Procedural Directive (in preparation) will address this topic. Researchers and other users will be encouraged to cite the datasets they use (see Section 3.3).
3.3. Usage Activities The third phase of the data lifecycle is Usage. These activities are typically outside the scope of data manager influence -- once a user has obtained a copy of the desired data, what he or she does with it may be unknown or uncontrolled. However, the ability to obtain and use data is certainly a by-product of a good lifecycle data management process, and information from or about users may influence or improve the data management process. NOAA is the biggest user of its own data, so improvements in data management could reduce cost and complexity within the agency.
* http://ils.unc.edu/~janeg/dartg/ † One example of NOAA data at risk is analog tide gauge data recorded on paper (marigrams) stored in over 1000 boxes at the US National Archives.
Users must be able to Discover and Receive data they want. These activities are enabled by NOAA Cataloging and Dissemination activities (Sections 3.2.5 and 3.2.6).
Analysis is defined broadly to include such activities as a quick evaluation to assess the usefulness of a dataset, or the inclusion of a dataset among the factors leading to a decision, or an actual scientific analysis of data in a research context, or data mining. Such activities are only possible if the data have been well-documented (Section 3.2.4) and are of known quality (Section 3.2.3).