Tue 14 Feb 2006
How are we storing our data in the db?
1. As a single point in space/time (snapshot)
2. As a range
This question seems to have major implications in our schema setup. Storing snapshot data is fairly straight-forward. We define a set of ‘required attributes’ to denote each point:
project, study, event, bottle, year, month, day, hour, minute, second, latitude, longitude, depth
If we are storing data ranges, the ‘required attributes’ are trimmed. Perhaps we may want to store yearly averages, mins, maxs, etc:
year, temp_avg_yearly, temp_min_yearly, temp_max_yearly, etc…
This introduces a number of new attributes in the data table. These attributes should be created dynamically based on certain qualifiers.
Do we generate this data by processing the ’snapshot’ values in the db? Or do we store pre-processed/calculated data?
Can we assume that all acquired data is snapshot data? If not, how do we handle the ‘range’ data?
At a glance, the SCCOOS data schema seems well-optimized for handling both kinds of data input (snapshots and ranges). However, it is not suited for sampling because it is missing a place to store bottle numbers and associated physical variables for each bottled sample.
The datetime, lat, long, depth values should be stored at the ‘bottle’ level. The SCCOOS schema introduces a redundancy by storing each of these fields as the ‘measurement’ level. Additionaly, this opens the door for inconsistencies in the physical variables, thus making it near impossible to correlate biological variables together.
Want to make a one-to-one mapping of chlorophyll data to nitrate data? Missing the bottle numbers? Have inconsistent depth measurements for chlorophyll and nitrate?
Good luck.
As I continue to work with the db schema, I am keeping myself focused in the ’sampling snapshot’ context.
Enter ’sampling snaphot’ mode:
Each cruise is composed of a series of events. Some of these events correspond to a CTD cast. For each CTD cast, we take a number of bottled water samples at various depths. Each sample has corresponding physical attributes (datetime, location, temp, etc.). Scientists will observe each water sample and record different biological values (chlorophyll, nitrate, etc.).
Ideally this should work. However, we have been burdening the scientists to record all the physical variables in addition to their biological observations. This becomes a duplicated effort, and in most cases the physical variables become out-of-sync.
In a more streamlined process, the scientists would only record the event number and bottle number. Everything else (datetime, lat, long, depth, etc.) should be stored at the ‘bottle’ level. As I understand it, this data is captured by the CTD cast.
Now, how do we integrate this to also handle ’streaming snapshots’, and additionaly ’streaming ranges’?

