Tue 14 Feb 2006
How are we storing our data in the db?
1. As a single point in space/time (snapshot)
2. As a range
This question seems to have major implications in our schema setup. Storing snapshot data is fairly straight-forward. We define a set of ‘required attributes’ to denote each point:
project, study, event, bottle, year, month, day, hour, minute, second, latitude, longitude, depth
If we are storing data ranges, the ‘required attributes’ are trimmed. Perhaps we may want to store yearly averages, mins, maxs, etc:
year, temp_avg_yearly, temp_min_yearly, temp_max_yearly, etc…
This introduces a number of new attributes in the data table. These attributes should be created dynamically based on certain qualifiers.
Do we generate this data by processing the ’snapshot’ values in the db? Or do we store pre-processed/calculated data?
Can we assume that all acquired data is snapshot data? If not, how do we handle the ‘range’ data?
At a glance, the SCCOOS data schema seems well-optimized for handling both kinds of data input (snapshots and ranges). However, it is not suited for sampling because it is missing a place to store bottle numbers and associated physical variables for each bottled sample.
The datetime, lat, long, depth values should be stored at the ‘bottle’ level. The SCCOOS schema introduces a redundancy by storing each of these fields as the ‘measurement’ level. Additionaly, this opens the door for inconsistencies in the physical variables, thus making it near impossible to correlate biological variables together.
Want to make a one-to-one mapping of chlorophyll data to nitrate data? Missing the bottle numbers? Have inconsistent depth measurements for chlorophyll and nitrate?
Good luck.
As I continue to work with the db schema, I am keeping myself focused in the ’sampling snapshot’ context.
Enter ’sampling snaphot’ mode:
Each cruise is composed of a series of events. Some of these events correspond to a CTD cast. For each CTD cast, we take a number of bottled water samples at various depths. Each sample has corresponding physical attributes (datetime, location, temp, etc.). Scientists will observe each water sample and record different biological values (chlorophyll, nitrate, etc.).
Ideally this should work. However, we have been burdening the scientists to record all the physical variables in addition to their biological observations. This becomes a duplicated effort, and in most cases the physical variables become out-of-sync.
In a more streamlined process, the scientists would only record the event number and bottle number. Everything else (datetime, lat, long, depth, etc.) should be stored at the ‘bottle’ level. As I understand it, this data is captured by the CTD cast.
Now, how do we integrate this to also handle ’streaming snapshots’, and additionaly ’streaming ranges’?
One Response to “Data Rant”
Leave a Reply
You must be logged in to post a comment.


February 15th, 2006 at 1:13 pm
i was thinking about this after we talked yesterday… i am not sure if i made any progress, but here i go anyways
i think you hit a major point with your comment: ‘can we assume all aquired data is snapshot data’. my initial, gut-reaction answer is that very little aquired data will be snapshot data… DEPENDING.
can you better define ’snapshot’ for me? are you thinking of a single measurement (ie. chl) or of a group of measurements taken at the same time and place (ie. a ‘bottle’ sample and all measurements associated with it)?
i think this may be the key to some of our stumbling yesterday since i think we are basically talking about scoping (did i use the word correctly karen?)… my interpretation of the SCCOOS schema is that data are stored at a single value level (each value has a time, place). i think your ’snapshot’ approach is a step more general than that by grouping data into mini clusters of values (each value has a time, place and other specific related values)? i am more used to working with data in ‘ranges’, which to me is the next level if we are talking about the same thing (i think of this more as a cast or timeseries, etc) as it is a wider grouping of the clusters (each SET of values have related times and/or places, and other specific related values).
i think there will be redundancies within a schema based on any of the three levels listed above, and i think there are pros and cons associated with each. if i remember correctly, the SCCOOS schema was set up that way to make it as flexible as possible, so that many varieties of groupings of values could be stored/created/queried out. while this may mean it is not optimized for one specific dataset, depending on the goals for the PAL data, it may still be a worthwhile foundation (ie. adding more fields to better accommodate intermediate-level data groups)?
questions that may help clarify the logic behind a PAL schema (in addition to your questions above):
- what types of data will be dropped into the database?
- what are the most common user-needs for these data (or are there too many to narrow down?)
- how much post-processing can be expected of the person inputting data, any automated processes within the database ‘black box’, end user (related to storing averages, etc).
ps. i miss the spell-check/review options to check before i post, have they moved or are they no longer available?