Metadata plays a pervasive role in data management, whether it is explicitly managed by the data system itself, or implicitly and procedurally tracked by operators and users of the data system. As data systems have grown more complex, and system interoperability and cross-discipline usage has become more prevalent, metadata has become a critical facet of major oceanographic data systems.
The Shore Side Data System suffers all the metadata challenges of any other data system. As described in this overview of MBARI metadata needs (pdf), MBARI must manage and provide a full range of metadata, spanning the entire metadata life cycle (also pdf).
The SIAM software infrastructure and SSDS architecture specifically address metadata life cycle issues, allowing the data system to successfully manage all facets of metadata. This architecture is briefly described below. However, using these metadata capabilities effectively remains a significant challenge.
Having provided an architectural framework for the metadata, we still must define and constrain the vocabulary entered into the framework, by system users and operators. Otherwise, users searching for "temperature" may get too many "Instrument temperature" variables, and not enough "Ocean sea surface temp" variables. The details of this terminology problem are complicated, but you may wish to refer to our Standard Naming Exercise effort to see a few of the challenges, and our attempts to work through them.
For a quick overview of the role of metadata and ontologies in data systems, please see Stephanie Watson's opening presentation to the Marine Metadata Workshop in 2003 (http://mmug.calfish.org/Marine Metadata.ppt>Powerpoint) (pdf).
SSDS Metadata Architecture
Metadata relevant to the SSDS resides in several places:
- the data source (the system generating the data);
- the SSDS object model;
- the SSDS relational database;
- and produced data files.
Data Source Metadata Elements
This section addresses the metadata found in the data source, which is the system which is generating data. This may be a single instrument, a collection of instruments, or an entire platform or observatory, including software processes. Note that the SSDS was designed for the problem of managing data from ocean observatories. It is particularly suited to those systems which have a metadata-centric design philosophy, such as the Monterey Ocean Observing System. So this discussion emphasizes that scenario, although SSDS can be applied to other situations also.
In the data source, metadata elements include information about:
- the data producers (instruments and sensors);
- the data they produce;
- and the processes which occur in the data-generating system.
In the case of MOOS devices, the data producer metadata is stored in pucks, which are individually attached to instruments before they are even deployed. The metadata is then transmitted through the system to SSDS at appropriate times, as determined by the instrument driver. (For other non-MOOS systems, metadata tracking and submission to SSDS must be custom-designed, according to the operations of those systems.) By physically connecting this information to the instrument being described, the MOOS data producers are made self-describing, a key operational feature for large deployments.
On the other hand, life-cycle metadata in MOOS systems -- describing the processes acting upon data sources and data streams -- must be produced by system software. If a system reprocesses other data, it must identify itself as the source of the new data stream (just like any other software-based generator of data). If an instrument or platform is deployed, or its deployment ends, the deployment process must capture this result and send it to the data system. These features must be designed into the user interfaces and system processes of a data system. They are only partially available so far in MOOS, but have been implemented in the AUV-CTD data transfer process, which had manual steps where the necessary metadata could be captured.
As mentioned earlier, one part of the data source's metadata describes the data produced by the source. If the metadata describes the contents of the data in sufficient detail, many powerful services are possible for manipulating and presenting the data. (The OPeNDAP protocols follow a similar model, but their raw data description capabilities are quite limited. On the other hand, their structural descriptions are much more advanced than SSDS has implemented.)
As a matter of bookkeeping, each data packet delivered to SSDS from a MOOS system contains a short header, which references the data producer metadata to use in understanding the packet. Through this header, SSDS can track all data inputs to their sources, and to the descriptions of those sources. Any other data system can similarly package its data, so as to obtain all the services of SSDS (and address multiple data management problems in a complex data delivery system).
The SSDS Object Model and Relational Database
These two SSDS components are presented together because they are designed as a matched set. Each of them captures the SSDS data model, according to the object-oriented or relational framework being used. The two formats are used to get the advantages of each: object-oriented development and memory access in the one case, and availability of mature, open-source data storage in the other.
Data is mapped between the two components using a general-purpose persistence architecture. (We use OJB,or ObjectRelationalBridge, to persist Java objects to relational databases. We are considering a switch to CMP, container-managed persistence.) At least in theory, this approach allows us flexibility in rearchitecting our Java object model, without having to write or refactor code to persist new and different objects.
The object model used by SSDS incorporates all the metadata concepts outlined in the Data Source Metadata Elements section above. SSDS Metadata Overview describes the components of this data model, which are populated by various ingest processes in SSDS.
About FGDC and Other Metadata Models
The Federal Geographic Data Committee produced a set of metadata standards for data collected by federal agencies. The Content Standard for Digital Geospatial Metadata, in particular, is widely used, and data management systems should expect to support this standard when making their data available. (Such support is planned, but not yet implemented, for SSDS.) These standards have not been selected as a basis for development of SSDS data management software, because they are difficult to work with computationally, and somewhat more geospatially- and file-oriented than was desirable for the SSDS.
Rather, the SSDS data model has more in common with the SensorML metadata model, a metadata framework created under the OpenGIS Consortium. While SSDS is not based on SensorML, we are evaluating its suitability and hope to find a route to compatibility.
Metadata in Data Files
Generally, when data files are presented or transported, it is strongly desirable to include descriptive metadata in the file. The netCDF file format is one common format which supports this capability. Because using embedded metadata as the primary reference limits architectural flexibility, we've chosen to maintain the primary metadata within the relational database and object model, to take full advantage of those referential capabilities. However, we intend to provide metadata as part of any served data file, wherever the requested file format supports it.
Metadata Description Format
We describe our metadata (which describes our data) using XML, the Extensible Markup Language. We have a metadata schema, which is in use but still evolving. Our software populates the Data Model (and eventually the Relational Database) by reading the XML files which have been stored in the instrument PUCKs and transmitted to SSDS.
XML files can also be created by hand and submitted manually, or can be created automatically in some cases. We are working on applications to make the development of the metadata descriptions -- containing instruments, data formats, and processes -- as easy as possible.