Data/Metadata


How Metadata affects QC - Julie Bosch

MMI Project - Notes about conference from Aug 2005. OWL-based ontologies.

COTS/ONR Project - …

QARTOD 1 - Dec 2003. QA/QC flags to be defined. Test cases for flags. Pros and cons of existing standards.

QARTOD 2- Feb-Mar 2005. Discipline specific metadata: waves, in-situ currents, remote currents.

POST QARTOD 2 - How to fit QC data into FDGC format?

Salinity Workshop - Aug 2005. QC for real-time salinity measurements. Drafted metadata record example of salinity data attributes.

Waves - Nov 2005. Significant advancements on QC requirements and recommendations. Identify best practices for continuing discipline scientific approach.

Katrina Analogy - Damaged bridge = gaps in scientific community. Build gaps to create fluid workflow…

11:00am

Why are ontologies important? - Luis Bermudez

Wikipedia definition of ontology. Wikipedia is not that reliable of a source, but it
good’s enough. Next is a longer more philosophical definition. Too long to repeat here.

Keywords from definition: Specific Purpose of Practical Difference

Example of Google Directory - Practical use of organization

Specification of conceptualizations. Ex. lake vs river. Each has properties: body of water… similarites identified.
Concepts are created and expressed as a class: Body of water, Lake, River
Classes are related.

Properties of class relations: isPartOf, isTransitive

Why use ontologies? Share common understandings, for software agents. Enable reuse of domainn knowledge. Make domain assumptions explicit.

Why part of QARTOD? - quality levels, flags, sensors, instrument methodology, calibration procedures, QC software, validation and verification methods, etc.
Can we map two different QC codesets?

Semantic issues - direct relations and inferred relations.

Use OWL format to build ontologies. Web Ontology Language. Based on RDF.. blah blah blah

How to convert to OWL? created tool called VOC2OWL. Java-based input form takes values and performs conversion. Other tools: Protege.

VINE - Vocabularly Integration Environment (oh, that’s what it stands for!). Mapping relations… this is sooooo MMI.
Notes about MMI Conference. Mapping results…

MMI Website

11;30am

Best Practices Workshop on Salinity - Jim Boyd

- stick with small group - too many people causes confusion, requires “educating” across topical areas
- specifically defined outcome
- breakout rooms for each topical area
- reconvene in plenary to share

(wahoo, my first attempt at blogging!)

mini “dictionary” for blog:
attribute - word used to describe a measurement for the Ecological Metadata Language. ex. temperature and chlorophyll-a are attributes (disclaimer: future naming conventions may change the way ‘temperature’ and ‘chlorophyll-a’ are written/displayed)
unit - a measurement standard used by EML to describe attribute. ex. the attribute ‘depth’ might have units of ‘meters’

________ - word used to represent the attribute-unit pair

________ - word used to describe the type of measurement (SI page uses ‘quanitity’)

The SI standard (as maintained by the international BIPM) defines a limited number of units and the types of measurements they describe. In EML, these measurement quantities are called ‘unitType’s. The unitTypes are the links between the units and the attributes.

Example: Temperature is a unitType as well as an EML attribute, the three units that describe this quanitity are Celsius, Kelvin and Fahrenheit. Length is a unitType with many describing units, including meter, foot, fathom, etc. Many attributes are quantities of length, including swell height, depth, and distance. All length attributes will be measured with one of the units under the unitType length.

The parentSI is the organization-prescribed main unit for each unitType. All units of each unitType must be mathematically related to the parentSI unit, and thus indirectly related to all other units of the same type.

Example: The three units under unitType temperature are Celsius, Kelvin and Fahrenheit, the parentSI unit is Kelvin, so the other two are defined by their mathematical relation to Kelvin. The parentSI unit of length is the meter, all other units of unitType length are defined by their relationship to the meter.

The reason for the parentSI units is a fundamental one: With physical standards behind each parentSI unit, thus measurements are able to be definitively reproduced, and that reproducability is the key to the scientific method. There is a physical object that defines a meter and that single object is what is used to create instruments of measurement. Similarly, there is a weight kept in a secure place that is THE kilogram. Since these physical standards are confined to the limited set of units, all other units must inherently relate back to the physically defined, hence using the parentSI as a means of conversion rather than another more indirect method.

SO, to define a unit, we need: unitName, unitType, and a multiplier or constant to SI. Abbereviations and descriptions also come in handy on a user level though are not necessary to the definition/conversion process.

An informatics seminar was given by Roy Mendelssohn, Env Research Division, SWFisheries Center, Pacific Grove on 18August05 at the NOAA Southwest Fisheries Science Center conference room titled “Data Integration and Interoperability in PACOOS, in NOAA, in IOOS and Beyond”. From the perspective of a domain scientist engaged with data analysis work and participating in ongoing national data committees, he provided insights drawing on recent experience putting together a working data system model in a short period of time.

Some informatics extensibility todo’s
-make rdb spatially enabled
-make metadata fgdc compliant
-make semantic & syntactic metadata conform to a given std
-take steps now to ease implementation/federation
-participate so to influence final requirements

Some unresolved issues
-how to stop people from misusing data
-how to balance serving products versus data
-how to balance data release versus data quality
-where do heavy lifting so to build in flexibility: local system or transport layer

Some notes and notions
-renaming: NMF->NOAA Fisheries->National Fisheries Service -> One NOAA
-DB systems have data and metadata separated while netcdf/hsd are self-describing file formats that have them reside together.
-data redundency as strategy for serving data in different ways
-dimension information; HDF not allow you to share dimensions in viewing cruise data but must choose view, ie all data at one station or look across stations

Three taking-back steps to avoid
-focusing on maps (describing) rather than data (analyzing)
-focusing on 2D (GIS) rather than 4D (netcdf space and time)
-using data strctures that have no scientific meaning (ie polygon or vector in GIS where no one collects a polygon)

Interoperability involving 3 interrelated issues:
-have the DATA
-describe the data with METADATA
-produce something that works DATA TRANSPORT

Since categories matter, note that DMAC has 6 categories of expert teams
http://dmac.ocean.us/dacsc/about_steering.jsp

  • standards process
  • archive
  • sys eng/enterprise architecture
  • modeling
  • metadata & discovery
  • data transport and access

I was a member of the PaCOOS mapping team taking on the role as a “Tools Specialist”. I was responsible for working with VINE, a Java application used to perform the mappings. PaCOOS datasets are similar to CTD datasets; Both seem to share many terms. Unlike the other groups, which focused on biological variables, we focused mainly on physical variables: latitude, longitude, temperature, depth, date, time, etc.
Some observations:

  1. Latitude and Longitude variables can be stored as a decimal, or a union of degrees, minutes, and seconds. We created a new relationship “unionof_#_of_n” where n is the total number of elements in the union. This relationship helped us mapped longitudes and latitudes that differed in the two stored formats.
  2. Salinity may be a “proxy” for Conductivity. Depth may be a “proxy” for Pressure. Though we detected the psuedo-relations in those pairs of variables, we did not capture those relations in our mapping file during the workshop.
  3. Date/Time is a beast that is broken into 6 individual components: year, month, day, hour, minute, second. After much debate on how these components are related, we came to the conclusion that each component is a class, and that the classes are organized in heirarchal fashion. Year is the super class, Month inherits from Year, Day from Month, and so on with Second the bottom child class. This heirarchy marks an order of significance: It is possible some datasets may only contain a year variable. Others may only contain year and month. We made the assumption that no dataset contains a single date/time component without those higher in the heirarchy. Though we detected how various date and time variables may map to each other, it was not intuitive (or possible) to accomplish this mapping using the VINE tool alone.
    It is possible to map each term to one of the 6 classes. For instance, a date variable with the format yyyy-mm-dd maps the Day class. Likewise, a time variable with the format hh:mm:ss maps to the Second class. Further, should the Second class should be “narrowerThan” the Day class because it specifies a higher precision?

Notes from our sessions can be viewed here.

Ontology Observation

I agree there will never be a definitive global ontology that covers everything. David Remsen cited modern biology during his keynote, explaining how the classification system is flawed because the same species may have multiple scientific names. A few reasons for this:

  1. Some species names change over time
  2. Some scientists assign new names without realizing older ones exist
  3. Scientists may disagree on the class of a specific species, resulting in that species being stored in two separate places and thus having 2 separate paths through the biological taxonomic tree.

It is a tedious and on-going task to keep up to date with all possible names for a single species. Further, with constant new discoveries and disagreements between scientists, it is inherently difficult (and near impossible?) to maintain a definitive ontology.

MMI focuses on merging some terms between ontologies, and mapping other terms across ontologies, but not on merging all ontologies together into one global source. If we can provide search engines with enough information of how separate vocabularies are related to each other, we eliminate the need for a single definitive source.

CTD Data Merging

On Wednesday, Cyndy gave a talk on merging CTD data (back) together to provide scientists with a convenient way to retrieve data with a single integrated search rather than multiple brute searches. She stressed the importance for scientists to capture metadata in addition to data in aiding the integration process. The problem with scientists is they tend to ignore capturing the metadata because they already remember everything for themselves. The task is tedious and feels like an unnecessary waste of time. Unfortunately, it becomes more difficult to refer to data in the future without its proper metadata. This includes physical variables such as date, time, and location, but should also include more granular information such as cruise event, ctd cast, and bottle number.

Roy Lowry made a comment of how “nightmareish” it is to splice datasets together keying off of depths, particularly when some scientists adjust depth values in their datasets to correct for offsets while others may leave their depth values unchanged. It would be much easier if all scientists recorded the bottle number for each ctd cast.

I attended the Marine Metadata Interoperability Project workshop in Boulder, CO last week. The MMI Project has long-term goals that include merging scientific data together from various sources to ease data querying using keywords across separate ontologies. Currently there exists many datasets, each with their own set of vocabularly words (for attributes, units, etc). Searching for data in one dataset may require different vocabularly keywords than those searched in a second dataset. It is a limitation for the user to not realize all possible keywords when performing a data search. MMI aims to alleviate this problem by mapping and merging like terms spanning various ontologies, ultimately providing more power to search engines by recognizing relationships between ontologies and returning more results.

We were separated in different domain groups for the workshop:

  • CTD
  • Waves and Currents
  • Chlorophyll
  • Sensors
  • Benthic Habitat
  • PaCOOS

Each team mapped terms across ontologies that were specific to that team’s domain. The three basic relations used in the mapping process were ’sameAs’, ‘narrowerThan’, and ‘broaderThan’. All of the ontology and mapping files are stored as OWL documents, and the mapping files were made available online at the end of the workshop. This initiative is a hopeful stepping stone for future mapping work and crosswalks between ontologies.

« Previous Page