Questions


How are we storing our data in the db?
1. As a single point in space/time (snapshot)
2. As a range

This question seems to have major implications in our schema setup. Storing snapshot data is fairly straight-forward. We define a set of ‘required attributes’ to denote each point:
project, study, event, bottle, year, month, day, hour, minute, second, latitude, longitude, depth

If we are storing data ranges, the ‘required attributes’ are trimmed. Perhaps we may want to store yearly averages, mins, maxs, etc:
year, temp_avg_yearly, temp_min_yearly, temp_max_yearly, etc…

This introduces a number of new attributes in the data table. These attributes should be created dynamically based on certain qualifiers.

Do we generate this data by processing the ’snapshot’ values in the db? Or do we store pre-processed/calculated data?

Can we assume that all acquired data is snapshot data? If not, how do we handle the ‘range’ data?

At a glance, the SCCOOS data schema seems well-optimized for handling both kinds of data input (snapshots and ranges). However, it is not suited for sampling because it is missing a place to store bottle numbers and associated physical variables for each bottled sample.

The datetime, lat, long, depth values should be stored at the ‘bottle’ level. The SCCOOS schema introduces a redundancy by storing each of these fields as the ‘measurement’ level. Additionaly, this opens the door for inconsistencies in the physical variables, thus making it near impossible to correlate biological variables together.

Want to make a one-to-one mapping of chlorophyll data to nitrate data? Missing the bottle numbers? Have inconsistent depth measurements for chlorophyll and nitrate?

Good luck.

As I continue to work with the db schema, I am keeping myself focused in the ’sampling snapshot’ context.

Enter ’sampling snaphot’ mode:
Each cruise is composed of a series of events. Some of these events correspond to a CTD cast. For each CTD cast, we take a number of bottled water samples at various depths. Each sample has corresponding physical attributes (datetime, location, temp, etc.). Scientists will observe each water sample and record different biological values (chlorophyll, nitrate, etc.).

Ideally this should work. However, we have been burdening the scientists to record all the physical variables in addition to their biological observations. This becomes a duplicated effort, and in most cases the physical variables become out-of-sync.

In a more streamlined process, the scientists would only record the event number and bottle number. Everything else (datetime, lat, long, depth, etc.) should be stored at the ‘bottle’ level. As I understand it, this data is captured by the CTD cast.

Now, how do we integrate this to also handle ’streaming snapshots’, and additionaly ’streaming ranges’?

During lunch break, Steve Diggs asked me “Why are ontologies important?” to which I aptly responded: “They aren’t”.

I explained the theory of a folksonomy, an emerging vocabulary set resulting from a bottom-up process in which members of a community freely choose keywords to their liking. A folksonomy is self-evolving, and provides an accurate model of the dynamic world we are trying to describe. This makes more sense to me than an ontology, which attempts to break everything into distinct categories from a top-down perspective.

Some sites that are based on folksonomies are (surprise!) delicious and Flickr. In fact, even Google’s search engine page-rank algorithm is based on a folksonomy. Instead of Yahoo!’s old approach of categorizing the web, Google ranks pages by popularity. But how do they know which sites are popular?…. they get that data straight from us! All Google does is aggregate existing data and perform algorithms to determine a site’s popularity, and thus, it’s rank order for search results.

The same logic applies to tagging for Delicious and Flickr. The more times one tag is used for the same object, the more meaningful that tag becomes. Statistical analysis can then be performed to determine which tags are frequently used and can relate like tags together.

An ontology serves a purpose only when it’s needed in a controlled environment. Building an ontology makes sense when all factors are considered and recognized. Software agents built on ontologies will run faster and more efficiently.

However, the world is not controlled. Scientific data is not controlled. Building an ontology here just doesn’t seem to make sense.

In following up on a researcher request for supplemental material posting on a project web site in coordination with a published journal article, we’ve discussed establishing a simplified path such as (http://pal.lternet.edu/suppl) that can be created as a physical location initially and shifted to a virtual pointer as our web structure matures. The idea is to provide directories tied then to the related database (in this case the bibliographic database with its attendant unique identifier (or LTER contribution#, ie http://pal.lternet.edu/suppl/biblio279)

At the following link

(http://www.elsevier.com/wps/find/journaldescription.cws_home/601265/authorinstructions)
,
the use of the Digital Object Identifier is summarized as follows:

“The digital object identifier (DOI) may be used to cite and link to electronic documents. The DOI consists of a unique alpha-numeric character string which is assigned to a document by the publisher upon the initial electronic publication. The assigned DOI never changes. Therefore, it is an ideal medium for citing a document, particularly ?Articles in press? because they have not yet received their full bibliographic information. The correct format for citing a DOI is shown as follows (example taken from a document in the journal Physics Letters B): doi:10.1016/j.physletb.2003.10.071″

“When you use the DOI to create URL hyperlinks to documents on the web, they are guaranteed never to change. ”

The idea of ‘guaranteed never to change’ brings forward the question of the length of ‘forever’ in contemporary organizational life or in internet timeframes and prompts two thoughts: 1) the lternet virtual pointer has an advantage of stability in addition to the original strengths of network identity and geographic indendence; 2) it might be worthwhile inquiring at the sio library about their insights or plan with respect to this type request.