Data/Metadata


In the midst of the ever-evolving database design process, I am starting to collect ADCP tidbits to better inform the future addition of this dataset to any site databases. The following notes are random and unorganized, but most will hopefully help advance the general understanding of these more complex data:

NOTE: The ADCP datatypes that will most concern us will be bottom-mounted, mooring-mounted (upward or downward looking), and shipboard. (other types would be ROV-mounted, instruments side-mounted on oil-rigs, etc) The following notes apply ONLY to shipboard ADCP data.

- Each ship has it’s own quirks, an understanding of which is needed to sucessfully process the data from each instrument.
New Horizon - instrument: Teledyne/RDInstruments Ocean Surveyor Broadband/Narrowband 150kHz ADCP
acquisition: PC with RDInstruments VmDas Version 1.42 software

- General: Teri Chereskin has been involved in all aspects of the CalCOFI ADCP data, from designing the instruement profiling scheme, to setting up the instrument pre-cruise, pulling data off of the collection computer post-cruise, transport, processing, any post-processing needed, and archiving. She maintains an independent ADCP database on her systems. I am trying to learn the ropes to take over some of those responsibilities for the CCE-LTER cruises. Teri’s webpage with documentation, ADCP data websites and other info is here:
http://tryfan.ucsd.edu/adcp/adcp.htm

- Data Size: The CalCOFI NH0604 cruise ADCP data is 2.9Gb, has 593 files and 5 different formats.

- Formats: .ENR - single ping raw data in beam coordinates (binary)
.ENS - serial info added at computer level (ie. with GPS brought in), beam coordinates with extra info (binary)
.ENX - in earth coordinates, calculated from internal header information (binary)
.N1R - GPS data (ascii)
.N2R - Ashtek data (ascii)

I believe that the ENS and ENX formats can be re-created by using the ENR and N1R, N2R files? I do not know the details yet of this nor of the CalCOFI final/reporting formats.

- Transport: Mark Ohman purchased a LaCie d2 Hard Drive Extreme for moving data from the ship to SIO. I originally formatted the drive for the PC in NTFS, but should possibly re-format to fat32? The NTFS format was readable on my Mac OSX, and was compatible with Teri’s system (type?). Teri has a disk on coast which can be directly connected with the external hard drive in the CCS server room.

more later…

What is a Dictionary?

Collecting words and their defintions into dictionaries is the work of lexicography. Funk and Wagnall’s Standard Dictionary of language specifies the meaning of the word ‘dictionary’ as ‘1. A reference work containing alphabetically arranged words together with their definitions, pronunciations, etymologies, etc. 2. A lexicon whose words are given in one language together with their equivalents in another. 3. A reference work containing information relating to a special branch of knowledge and arranged alphabetically.’

A research science dictionary is 1. A reference work containing a collection of terms that are used in a scientific community along with information that is required to understand each term. 2. A reference work that prescribes a standard for the community language. 3. A reference work written to help translate terms between texts and languages (i.e.. from a journal to a computer processing program). While a standard language dictionary helps a reader understand an unfamiliar word by relating it to information categorizing that term specifically (internal), a scientific dictionary helps researchers understand and utilize data collected elsewhere by defining terms both internally and externally in the context of the community. The definition of a term may be dependent on any combination of the following features:

Internal
Human Usage: Abbreviations, formal names and publication preferences must all be taken into account in the definition of a term so that it can be widely recognized.
Standards: Standards to which a term relates should be described fully in a dictionary. For example, measurement terms will often refer to International System of Units (SI) standard, so a unit dictionary will include definitions that relate a measurement back to the parent SI unit of the same unit type.
History: A dictionary can also bridge technology leaps and changes in community practice. For instance, previously used data processing programs might have needed one set of information while current programs use another, or changes in data collection from human-gathered to instrument-collected can cause language barriers to data comparisons. A dictionary, in providing language used for all cases, can provide back-compatibility to datasets that might not otherwise be useable.

External
Community Culture: (see Databits article: “Designing a Dictionary Process: Site and Community Dictionaries”) “Although names and their definitions are seemingly mundane and even trivial concepts, this does not mean that the articulation, exchange, and blending of unit and attribute names are simple matters. Names go to the heart of local work practices and of data interoperability.” Local nicknames propagate through work practices and become standard within that community; recognizing and including both local and intra-community culture as part of a dictionary creates a human-accessible document for translation between groups of people.
Computer Usage: In this age of rapidly increasing technological power, computers are taking over parts of data analysis previously preformed by humans. To do this, the computer and specifically any programs need to know many things about the data, for example if they are binary or ASCII, string or integer, etc. A dictionary, as Funk and Wagnalls noted, is a tool to translate from one language into another, in this case from human-accessible data into programming terms for automated computations.
Technology Infrastructure: Database software, analysis software and programs themselves all need different types of descriptors in order to run efficiently, and to allow the greatest access, search and display features. A dictionary can provide many types of technological information to facilitate cross-platform and cross-system access, for example the format for the dates and times present, etc.

Dictionary Purpose
A dictionary is created for a number of reasons listed above, including describing terms and prescribing a standard, however the purpose of a dictionary is also directly tied to the needs of the end-user and the audience for whom it is created. In fulfillment of these needs, a dictionary’s purpose also includes providing access to shared data, aiding in database searches, providing information needed for interoperability, guiding entry-level projects and informing controlled vocabulary work.

Uses for Different Sized Groups
A small team of people such as a laboratory group may use a dictionary in order to move away from ‘tribal knowledge’ and articulate their local standards for field acquisition and data processing. On this level, a dictionary can also bring together the language of people with different job descriptions; a field technician and lead scientist can use a dictionary to log and document all appropriate methods and acquisition metadata, a programmer can use a dictionary in order to optimize processing code, and an information manager can use a dictionary in order to efficiently archive files into a database or reference the proper standard, etc.
When multiple small groups are collaborating on a project, the dictionary becomes a tool of interoperability that allows the merging of datasets collected and processed by the individual groups on the human and computer levels. Intra-group differences in methodology and abbreviations for like measurements are clearly articulated and possibly resolved in a single dictionary or a combination of dictionary types (see following Dictionary Types section for a brief list).
A community-wide dictionary allows for automated data comparisons spanning many differences such as in acquisition methodologies. Carbon production for example can refer to land or water-based measurements collected with vastly different methodologies, processed using different calculations, etc. Dictionaries enable the collation of carbon production data from many sources, enabling comparisons and faciliting any potential unit conversions.
Dictionary Types

There are many types of dictionaries, a few examples are listed here:
A code dictionary is a mechanism by which coded entries in a dataset can be explained by outside documentation. Codes are a straightforward and efficient way for a group to communicate locally, and storing the code information in a dictionary format provides a centralized clearinghouse for this important knowledge so users not familiar with the colloquialisms can reference material without speaking to an individual within the group. A common use of codes is in naming field stations; a code dictionary might contain a list of field station names translated into latitude and longitude, or pointers to a paper describing the field grid layout and station positions. An acronym dictionary would also fall under this type.
A unit dictionary links local measurements to a standard or an accepted scientific convention (i.e. the SI standard of units) and bridges local abbreviations and unit names to language preferred by journals and technical publications. From a unit dictionary, a user can generate a list of all entries of (SI) unit type ‘length’, convert between them and provide proper abbreviations as used in a domain journal. Unit dimensions and types are also an important part of the unit dictionary as this information facilitates automated conversions and informs the creation of new units that may not directly relate to the standard, such as units of abundance.
An attribute dictionary details information about attributes stored in a database, including links to unit and code dictionaries. For example, a temperature measurement might be defined by an attribute dictionary with information including what type of temperature is recorded (sea surface temperature), what units the measurement is in (pointer to the unit dictionary entry for ‘Celsius’), a description of the value (a real number, stored as a float with a precision of 0.01). use micromolar example here?
A method dictionary is one way of standardizing methodologies and aiding in metadata entry to a database. Rather than writing a complete method section for each dataset, references to predetermined and accepted practices will pull the proper information out of a method dictionary for insertion into a database or file. find USGS example

Dictionary Vision
A dictionary results from a collaborative process where people with different research goals from different scientific projects, and even from different branches of science, come together with the goal of comparing and/or sharing field measurements and models as well as providing a framework for interoperability to answer larger scientific questions. A dictionary bridges differences in datasets to enable direct comparisons and it fosters understanding between scientists who may use different terminology, computer processing techniques or operating systems. Further, it provides a mechanism to use collected data for a purpose beyond it’s original scope.
Deciding what information is needed to define a particular term involves the interpretation and discretion of the dictionary creator(s), but in this openness and lack of restriction is a flexibility that makes the notion of a dictionary so useful and important. Science is not a rigid field, it is fluid and ever-changing as hypotheses are proved and disproved, and as new perspectives, concepts and technology expand our ability to measure, analyze and perceive the world. A dictionary is dynamic in order to accommodate changes in understanding while at the same time serving as a static standard to inform data use.

Following discussion of data base management system types this week, Geof sent a follow-up link http://www.service-architecture.com/database/articles/index.html. There’s a summary table at the bottom comparing dbms standards.

This is an interesting read (pulled from my delicious links) on database optimization, and the downsides of normalization:

Normalized data is for sissies

The article links to a pdf presentation given by Cal Henderson, who helped create Flickr. A quick snippet:

In Flickr’s case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.

How are we storing our data in the db?
1. As a single point in space/time (snapshot)
2. As a range

This question seems to have major implications in our schema setup. Storing snapshot data is fairly straight-forward. We define a set of ‘required attributes’ to denote each point:
project, study, event, bottle, year, month, day, hour, minute, second, latitude, longitude, depth

If we are storing data ranges, the ‘required attributes’ are trimmed. Perhaps we may want to store yearly averages, mins, maxs, etc:
year, temp_avg_yearly, temp_min_yearly, temp_max_yearly, etc…

This introduces a number of new attributes in the data table. These attributes should be created dynamically based on certain qualifiers.

Do we generate this data by processing the ’snapshot’ values in the db? Or do we store pre-processed/calculated data?

Can we assume that all acquired data is snapshot data? If not, how do we handle the ‘range’ data?

At a glance, the SCCOOS data schema seems well-optimized for handling both kinds of data input (snapshots and ranges). However, it is not suited for sampling because it is missing a place to store bottle numbers and associated physical variables for each bottled sample.

The datetime, lat, long, depth values should be stored at the ‘bottle’ level. The SCCOOS schema introduces a redundancy by storing each of these fields as the ‘measurement’ level. Additionaly, this opens the door for inconsistencies in the physical variables, thus making it near impossible to correlate biological variables together.

Want to make a one-to-one mapping of chlorophyll data to nitrate data? Missing the bottle numbers? Have inconsistent depth measurements for chlorophyll and nitrate?

Good luck.

As I continue to work with the db schema, I am keeping myself focused in the ’sampling snapshot’ context.

Enter ’sampling snaphot’ mode:
Each cruise is composed of a series of events. Some of these events correspond to a CTD cast. For each CTD cast, we take a number of bottled water samples at various depths. Each sample has corresponding physical attributes (datetime, location, temp, etc.). Scientists will observe each water sample and record different biological values (chlorophyll, nitrate, etc.).

Ideally this should work. However, we have been burdening the scientists to record all the physical variables in addition to their biological observations. This becomes a duplicated effort, and in most cases the physical variables become out-of-sync.

In a more streamlined process, the scientists would only record the event number and bottle number. Everything else (datetime, lat, long, depth, etc.) should be stored at the ‘bottle’ level. As I understand it, this data is captured by the CTD cast.

Now, how do we integrate this to also handle ’streaming snapshots’, and additionaly ’streaming ranges’?

Ontology of Folksonomy: A Mash-up of Apples and Oranges is a great article explaining the individual benefits of both ontologies and folksonomies, and how they can potentially integrate together to aid data interoperability.

An excerpt:

The attack on “ontology” is really an attack on top down categorization as a way of finding and organizing information, and the praise for folksonomy is really the observation that we now have an entirely new source of data for finding and organizing information: user feedback. For the task of finding information, taxonomies are too rigid and purely text-based search is too weak. Tags introduce distributed human intelligence into the system. As others have pointed out, Google’s revolution in search quality began when it incorporated a measure of “popular” acclaim — the hyperlink — as evidence that a page ought to be associated with a query. When the early webmasters were manually creating directories of interesting sites relevant to their interests, they were implicitly “voting with their links.” Today, as the adopters of tagging systems enthusiastically label their bookmarks and photos, they are implicitly voting with their tags. This is, indeed, “radical” in the political sense, and clearly a source of power to exploit.

Open-form discussion notes:

- Lack of focus on Quality Assurance at QARTOD? QA is essential aspect.

- Need QC and QA… How to approach QA?

To data providers: Share algorithms, workflows, etc…

On metadata:
- Julie Bosch - Coming along nicely on info that needs to be captures. Focus on defining critical information, not how to fit it in a metadata format.

- On FGDC metadata - specifications manual is cumbersome.

More Metadata Discussion
Spear-headed by Steve Diggs…

Not here to “bash” metadata, but…

Why do we need metadata? We’ve always had it, but why do we need it?
With very few datasets, no need to worry about QC… didn’t need that info in the dataset. Smaller hard drives
mean less storage capacity, etc.

At present, more data being taken. Data volume is continually increasing. Advanced technology and instruments. But still stuck in a rut in metadata paradigms…. do we really need to integrate our data?

What is the data/metadata dividing line? A lot of “metadata” is more important… shouldn’t be defined as “metadata”.

Analogy of dog chasing car… we are chasing metadata objectives.

What would we do if we had PERFECT metadata for all datasets? What could or would we do?

Ontologies (top-down) vs. Folksonomies (bottom-up)
…oy.. too much to hype here right now… we are all taking stabs at this one.

iPod: bottom-up demand top-down design

Folksonomy for data discovery, ontology for data delivery

Ideal: Google search “waves data”. Returns lots of data in any format…

Last post was a duplicate effort from Melissa Carter’s notes… she’s the Recorder. I’m sure her notes will eventually be online as well (but not as quickly!)

Building a matrix now…. I scratched the table.. too clunky to handle in a blog. Using h3, strong, and newlines instead to model matrix.

Parameter: Conductivity

Range Test (Gross)
Criteria: Absolute number

Range Test (Climatology profile)
Criteria:+/- (n) standard deviation

Gradient Test
Criteria:

Spike Tests
Criteria: Determined by data provider

Definitions

Bounds - defined to US EEZ only or to entire world

Climatology - historically data as function of …TBD

Range tests (gross) - bounds on the parameter: removes gross errors

Range tests (climatology) - bounds determine for specific zones: possible references - Ocean laboratory table

Spiking routines - look at parameter spectra and determine outliers

Gradient test - Difference in the value between adjacent measured values (at specific locations?): spatial and temporal

Showed up late this morning… CTD is now meeting in Vaughn 100, Waves is in Old Library…

People are discussing QA/QC methods, tests, etc. for CTD. Copying what’s been written so far…. this may be a duplicate effort, but can also be viewed as a backup ;)

Focus on Quality Control and Real-time data collected in-situ

What is real-time data?
- Is definition up to the user?
- OCSD - within hours

What is Quality Control?
- Portion after the data is collected.
- Requires an activity which checks the parameters.

Methods of colection
- Profiling from ship
- Moored
- Profiling floors
- Fixed platofrms
- Gliders/AUVs… potentional for RT with telemetry
- Expendables

Parameters - define required for QC
Primary Sensors
- Conductivity
- Temperature
- Pressure
- Oxygen
- Other optical
- Position
- Date/Time/Time reference

Derived
- Depth
- Salinity
- Depth

Additional
Consider derived parameters
• Methods
• Varied Instruments

Metadata
- Time
- Position
- Time reference
- Bottom depth/Station depth… is this a parameter? Part of QA or QC?

Ways to verify data collected

No brainer tests
- Range tests: storm options.
- Climatology test: consider specific areas and seasonality
- Gradient test
- Spiking routines - 3 point test, running std of data
- Comparison with other parameters: correlation of std and compared to other sensors
- Comparison with prior or archived data: running std of data

Further tests: additional methods
- Dual sensors
- Descent rate - specific to profiles
- Ensure derived parameters are within boundaries
- Freeezing point test
- Comparison between adjacent sensors - vertical and horizontal
- Discrete samples or additional data from sensors- not real-time, QA?
- TSP relationships: water mass characteristics that differ depending on location
- Comparison with models

Consider what is required, perferred, etc.

Words of Caution
- Flag instead of throwing out data

To remove data or to flag?
- Recognize instrument problem: recover or remove?
- Allow for user to determine whether they want flagged data

Who’s responsibility is it?

Problems and how to QC
- Stuck sensor: see constant

Approaches to QC
- Automated QC versus human checking

Ordering for NOAA
- Tests of location and identification of a station/date/time.
- Stage two: spikes
- Stage three: climate
- Stage four: visual inspection

During lunch break, Steve Diggs asked me “Why are ontologies important?” to which I aptly responded: “They aren’t”.

I explained the theory of a folksonomy, an emerging vocabulary set resulting from a bottom-up process in which members of a community freely choose keywords to their liking. A folksonomy is self-evolving, and provides an accurate model of the dynamic world we are trying to describe. This makes more sense to me than an ontology, which attempts to break everything into distinct categories from a top-down perspective.

Some sites that are based on folksonomies are (surprise!) delicious and Flickr. In fact, even Google’s search engine page-rank algorithm is based on a folksonomy. Instead of Yahoo!’s old approach of categorizing the web, Google ranks pages by popularity. But how do they know which sites are popular?…. they get that data straight from us! All Google does is aggregate existing data and perform algorithms to determine a site’s popularity, and thus, it’s rank order for search results.

The same logic applies to tagging for Delicious and Flickr. The more times one tag is used for the same object, the more meaningful that tag becomes. Statistical analysis can then be performed to determine which tags are frequently used and can relate like tags together.

An ontology serves a purpose only when it’s needed in a controlled environment. Building an ontology makes sense when all factors are considered and recognized. Software agents built on ontologies will run faster and more efficiently.

However, the world is not controlled. Scientific data is not controlled. Building an ontology here just doesn’t seem to make sense.

Next Page »