«NOAA Environmental Data Management Framework NOAA is, at its foundation, an environmental information generating organization. Fundamental to ...»
2.5.1. Infrastructure NOAA infrastructure involved in environmental data management includes the observing platforms and systems themselves, data collection and processing systems, the archival data centers (NCDC, NGDC, NODC) and their associated systems for data ingest, storage and stewardship, other NOAA centers of data, dedicated data links such as the WMO Global Telecommunication System (GTS) and Satellite Broadcast Network (SBN), general-purpose network infrastructure, high-performance computing systems, and other computing resources. NOAA partners also operate infrastructure for data that NOAA may ingest.
These infrastructure components are expensive to acquire and maintain. Costs can be reduced over the long term by avoiding project-specific systems built from scratch. Instead, gradual adoption of commodity hardware and software, and the establishment of enterprise systems that provide functionality for multiple projects or the entire agency, are preferable. Adoption of interoperability standards (see Section 2.4) will support and simplify information exchange among NOAA systems and between NOAA and external data providers. Costs may be reduced by using commercial or NOAAoperated Cloud services (shared, pay-as-you-go information technology (IT) resources such as storage, processing, or software that can be scaled up or down based on demand). * 2.5.2. Service-Based Approach NOAA environmental data must be available to users both inside and outside of NOAA. It is more efficient to make a given dataset accessible from a single authoritative source than to have users download, maintain, and possibly redistribute multiple copies, because the timeliness and accuracy of duplicative collections becomes increasingly uncertain. NOAA data and metadata should therefore be delivered through services -- that is, through web-based interfaces that can be invoked by software applications. These services can offer functions such as searching for data, retrieving a copy or a subset of data, visualizing data (e.g., producing a colored map or a time-series graph), or otherwise transforming data (e.g., converting to other formats or other coordinate systems). Rather than establishing vertically-integrated "stovepipes" that only provide services for specific users and customers, a shared-services architecture, as illustrated in Figure 4, is recommended.
* See Appendix B: Cloud Computing for further discussion.
Figure 4: Schematic of shared-services architecture. Rather than explicitly linking individual data producers to specific customer applications, data management services and tools are generalized and decoupled as much as possible. Shared services can be established at an agency level (e.g., for data catalogs), and compatible services (e.g., based on the same pre-approved software) can be established at the program level where needed.
Services should be as consistent and standardized as possible to simplify the programming of applications that can integrate information from multiple sources. Such applications currently exist for a variety of well-known service types. New or enhanced applications can be written by NOAA, our partners, and the private sector as needed. The Digital Government Strategy (16) states that "We must enable the public, entrepreneurs, and our own government programs to better leverage the rich wealth of federal data to pour into applications and services by ensuring that data [are] open and machinereadable by default."
NOAA data exist in many heterogeneous systems managed by multiple independent operators. National and NOAA activities in support of data center consolidation are designed to reduce the total number of computing facilities with dedicated power and cooling, and often with underutilized capacity, as a costsaving measure. However, consolidation is unlikely to result in completely merging all diverse NOAA systems for distributing and archiving data into a single master system. Even if such a target state were achievable within NOAA, other agencies, other nations, and the private sector will retain their own systems. A federated systems approach, as illustrated in Figure 5, is therefore necessary to leverage and harmonize multiple legacy, modern, and future systems that have evolved separately and are managed independently. A federated system is a collection of project-specific or agency-wide information systems that are independently managed and loosely coupled in a way that provides the behavior of a single
Figure 5: Schematic of service-based approach to providing access to data and metadata from observing systems. Data are stored in databases or file systems. Data access is mediated by services that provide security (limiting direct interaction with the back-end system), convenience (providing a table of contents and allowing customized subsets to be requested), and standardization (making access methods and formats compatible even if the internal storage differs). Catalogs can be built from these data access services, and can provide a discovery service to enable users to search for data. Value-added services such as visualization or other transformations can be provided, either by the original data holders or by third parties. Thematic portals can be constructed to present a unified access point to related datasets from multiple sources.
2.5.3. Designing for Flexibility Innovations in IT and engineering are frequent and may offer significant benefits in cost or efficiency.
NOAA should strive for modular and flexible architectures for observing systems, data management systems, and IT infrastructure in order to allow emerging technologies to be readily implemented.
Custom-built, vertically integrated systems guided by inflexible design methodologies should be avoided because they are difficult to modify and may lock NOAA into old technologies or specific vendors..
Version 1.0 19 2013-03-14 NOAA Environmental Data Management Framework 2.
6. Assessment Assessment of NOAA data management activities includes estimating the current state, measuring progress, and getting feedback from users and implementers. The attributes we can assess include completeness of EDM planning, quality of metadata, level of data accessibility, and successful preservation for the long term.
• Estimating the current state of NOAA EDM: The Technology, Planning and Integration for Observation (TPIO) * program is assessing how data from NOAA Observing Systems of Record are managed. This will provide a baseline status.
• Measuring progress: Line-office representatives report on the implementation of Procedural Directives at meetings of the EDMC. The EDMC chair reports progress to NOSC and CIO Council several times per year. TPIO and the NGDC Enterprise Data Systems Group have begun prototyping a Data Management Dashboard intended to show current values and trends in metrics such as metadata quality and data accessibility.
• Feedback: NOAA personnel and contractors involved in EDM are invited to contact the EDMC and the DMIT regarding successes, failures, lessons learned and suggestions concerning this EDM Framework, EDMC Procedural Directives, and related activities. NOAA data providers can seek and respond to feedback from users. The US Paperwork Reduction Act imposes some limitations on methods for gathering feedback. †
3. The Data Lifecycle The Data Lifecycle includes all the activities that affect a dataset before and during its lifetime. Different datasets may have somewhat different lifecycles, but this model is intended to be general. The use of the term "lifecycle" includes long-term preservation and is not meant to imply a finite lifetime or limited
period of usefulness. We divide lifecycle activities into three groups, as shown in Figure 6:
• Planning and Production, which includes all activities up to and including the moment that an observation is captured by an observing system or data collection project;
• Data Management, which includes all activities related to processing, verifying, documenting, advertising, distributing and preserving data;
• Usage, which includes all activities performed by the consumer of the data (these activities are often outside the direct control of data managers).
* TPIO resides within NESDIS but performs NOAA-wide functions including supporting the NOAA Data Management Architect; serving as Executive Secretariat for the EDMC, NOSC, and DAARWG; and maintaining and analyzing a database of observing systems and requirements. See https://www.nosc.noaa.gov/tpio/.
† http://www.cio.noaa.gov/Policy_Programs/pracust.html Version 1.0 20 2013-03-14 NOAA Environmental Data Management Framework Figure 6: Overview of the Data Lifecycle, showing a decomposition into Planning and Production, Data Management, and Usage activities. The block arrows suggest the normal flow of information from planning towards usage, and the curved arrows indicate that the process may be cyclical, with conceptually "later" activities feeding back to or triggering "earlier" activities.
Figure 7 is a more detailed view of the Data Lifecycle, including all of the activities mentioned in this Section. The Data Lifecycle is a dynamic process rather than a linear sequence. That is, the steps in the lifecycle are not independent, but rather depend on and influence actions taken at other steps. For example, inadequate documentation at an early stage can prevent later use; generation of products from original data may yield new derived data that must also be collected and managed; user feedback regarding data may change or augment the documentation about data. Likewise, because data may go through multiple cycles of use and reuse by different entities for different purposes, effective management of each step, and coordination across steps in the lifecycle, are required to ensure that data are reliably preserved and can be accessed and used efficiently.
Figure 7: Activities in the Data Lifecycle. The Data Management Activities block is the focus of this Framework.
A lifecycle data management process ensures that observing systems are based on requirements, that the resulting data are properly stewarded, and that data can be used both for their original purpose and in novel ways.
Each phase of the Data Lifecycle is described in the following sub-sections.
3.1. Planning and Production Activities
The first phase of the Data Lifecycle is Planning and Production activities, which comprise:
These include such tasks as assessing the need and requirements for a new observing system, planning how to meet those requirements and how to manage the resulting data, developing any necessary sensors, deploying the observing system, and operating and maintaining the observing system.
The Planning activity includes preparing for management of the resulting data. The Data Management Planning Procedural Directive (7) requires such planning and provides a template of questions to be considered. This planning should be done before data are collected, but existing projects without adequate plans should also address the issues. The Data Management Plan should be flexible and updated as needed because matters not considered in the original plan, or changes in technology, may emerge as data are acquired, processed, distributed, and archived. Program managers, project leaders, and technical personnel should work together and with NOAA EDM groups to plan data management in ways that maximize data compatibility and reduce overall costs.
The other activities in this phase are largely outside the scope of this EDM Framework, which focuses instead on the management of actual data once observations are collected. Nevertheless, activities that occur later in the Data Lifecycle may influence this phase. For example, a calibration error discovered during quality control may lead to changes in the operating procedure, and gap analysis may reveal new requirements.
3.2. Data Management Activities
The second phase of the Data Lifecycle is Data Management Activities, which include: