I was a member of the PaCOOS mapping team taking on the role as a “Tools Specialist”. I was responsible for working with VINE, a Java application used to perform the mappings. PaCOOS datasets are similar to CTD datasets; Both seem to share many terms. Unlike the other groups, which focused on biological variables, we focused mainly on physical variables: latitude, longitude, temperature, depth, date, time, etc.
Some observations:

  1. Latitude and Longitude variables can be stored as a decimal, or a union of degrees, minutes, and seconds. We created a new relationship “unionof_#_of_n” where n is the total number of elements in the union. This relationship helped us mapped longitudes and latitudes that differed in the two stored formats.
  2. Salinity may be a “proxy” for Conductivity. Depth may be a “proxy” for Pressure. Though we detected the psuedo-relations in those pairs of variables, we did not capture those relations in our mapping file during the workshop.
  3. Date/Time is a beast that is broken into 6 individual components: year, month, day, hour, minute, second. After much debate on how these components are related, we came to the conclusion that each component is a class, and that the classes are organized in heirarchal fashion. Year is the super class, Month inherits from Year, Day from Month, and so on with Second the bottom child class. This heirarchy marks an order of significance: It is possible some datasets may only contain a year variable. Others may only contain year and month. We made the assumption that no dataset contains a single date/time component without those higher in the heirarchy. Though we detected how various date and time variables may map to each other, it was not intuitive (or possible) to accomplish this mapping using the VINE tool alone.
    It is possible to map each term to one of the 6 classes. For instance, a date variable with the format yyyy-mm-dd maps the Day class. Likewise, a time variable with the format hh:mm:ss maps to the Second class. Further, should the Second class should be “narrowerThan” the Day class because it specifies a higher precision?

Notes from our sessions can be viewed here.

Ontology Observation

I agree there will never be a definitive global ontology that covers everything. David Remsen cited modern biology during his keynote, explaining how the classification system is flawed because the same species may have multiple scientific names. A few reasons for this:

  1. Some species names change over time
  2. Some scientists assign new names without realizing older ones exist
  3. Scientists may disagree on the class of a specific species, resulting in that species being stored in two separate places and thus having 2 separate paths through the biological taxonomic tree.

It is a tedious and on-going task to keep up to date with all possible names for a single species. Further, with constant new discoveries and disagreements between scientists, it is inherently difficult (and near impossible?) to maintain a definitive ontology.

MMI focuses on merging some terms between ontologies, and mapping other terms across ontologies, but not on merging all ontologies together into one global source. If we can provide search engines with enough information of how separate vocabularies are related to each other, we eliminate the need for a single definitive source.

CTD Data Merging

On Wednesday, Cyndy gave a talk on merging CTD data (back) together to provide scientists with a convenient way to retrieve data with a single integrated search rather than multiple brute searches. She stressed the importance for scientists to capture metadata in addition to data in aiding the integration process. The problem with scientists is they tend to ignore capturing the metadata because they already remember everything for themselves. The task is tedious and feels like an unnecessary waste of time. Unfortunately, it becomes more difficult to refer to data in the future without its proper metadata. This includes physical variables such as date, time, and location, but should also include more granular information such as cruise event, ctd cast, and bottle number.

Roy Lowry made a comment of how “nightmareish” it is to splice datasets together keying off of depths, particularly when some scientists adjust depth values in their datasets to correct for offsets while others may leave their depth values unchanged. It would be much easier if all scientists recorded the bottle number for each ctd cast.