March 2006


by Lynn and Jerry

In the past, physical copies of data, articles and technical reports published at SIO were collected through the SIO Publications Committee (a sub-committee of the SIO Staff Council), given a SIO Reference Series publication number, and listed in an annually published hardcopy bibliography. This centralized system broadened community awareness of research and provided a common location for all publications independent of the scientist or lab involved in the research. Institutional support for this program ended in 2002, and with the retirement of Kitty Kuhns, the annual bibliographies ended.

Today, as demand for access to articles and data is growing, the need for a centralized, stable and accessible repository has re-emerged with some new obstacles. The web has provided the needed centralized space of which many researchers and groups have taken advantage. However, with constantly shifting URLs, servers, and funding cycles, maintaining these local access points is a laborious and frustrating task.

To fill the need, the SIO Library has assumed the local responsibility to maintain an online archive and repository called the e-scholarship Repository . This is a free service provided by a partnership between the University of California Office of the President (UCOP) and the non-profit California Digital Library(CDL). An author can submit a published paper in PDF format to the e-scholarship system through an email to siolib@sio.ucsd.edu, an email alias provided by Peter Brueggeman, Director of the SIO Library. The paper will be posted under the appropriate headers (headers include: Tech Reports, SIO Reference, etc) and given a permanent URL. Data, photos, graphs, and other file types can be submitted as well, either independently or as an ‘associated file’ to a paper. Web URLs can be included, though as websites grow, shift and age, links commonly break and are discouraged unless realistically stable for the long-term.

SIO has a separate space in the e-scholarship Repository and allows further subdivision into ‘centers’ and ’series’. The OceanInformatics group for instance can request designation as a ‘center’ with all submissions falling under this hierarchical step. A center can have multiple series which group submissions by designated subjects.

All files in the repository are stored on the servers at the CDL. The CDL is responsible for moving files and maintaining the permanent URLs through server upgrades, etc.

A note about copyrights:

To post a published article as part of the SIO Repository, the lead author must be granted permission through the publishing house or agency. While many journals have denied copyright releases for online access, a few have begun to open up to the idea. Anecdotally, the Elsevier journals (including Continental Shelf Research, Progress in Oceanography and others) will grant authors permission to post their articles online. The research society journals (i.e., those published through the AMS, AGU, etc.) have tended to be less cooperative in this regard though AGU will consider allowing pre-prints to be posted. Authors should investigate any copyright restrictions before posting a published work, the SHERPA/ROMEO list is an excellent place to start.

On 6 March, 2006 there was an alert from iocean regarding probable failure with the RAID disk system:


==============================================================
A problem has been detected on this server.

Status Summary

Reason(s) for notification:
Drives

Server:
Host : iocean
Model : RackMac3,1
Uptime : 67016 minutes
OS version : Mac OS X Server 10.4.4 (8G32)
Processor : 2 x 2000 MHz
Memory : 1024 MB
BootROM : $0005.17f1
Serial : QP41703XPNK

Memory:
Memory Slot "DIMM0/J11" : 512MB, ECC DDR SDRAM, PC3200U-30330
Memory Slot "DIMM1/J12" : 512MB, ECC DDR SDRAM, PC3200U-30330

Drives:
Drive 1 (disk2) : Normal
Drive 2 (disk1) : Normal
Drive 3 (disk0) : Warning
==============================================================

Over the course of the few days leading up to the failure warning, a number of users reported slow database performance in some web applications and services provided by iocean (Hlab Forum, etc.). The data disk[s] on iocean (/Volumes/iodata) is a RAID 1 (mirrored) diskset consisting of two (disk0 and disk2) of the three disks on the server (the third disk, disk1, is a standalone volume, ioceanHD, which contains the operating system and applications). The RAID volume, iodata, is a software-based RAID.

iocean is under AppleCare warranty which provides hardware support for up to three years after purchase. We ordered a new disk from Apple through the Upper-Campus Tech Services and it arrived late afternoon on 7 March. We replaced the failed disk with the new disk the next morning, 8 March, and tried to start rebuilding the RAID array using the Disk Utility (GUI) tool on iocean. This is accomplished by dragging the new disk icon into the RAID window. However, this didn’t work. The GUI tool gave no indication as to why this failed. We next went to the commandline tool, ‘diskutil’ to diagnose the problem. Using the command:


===========================================
iocean:~ jrw$ diskutil list
/dev/disk0
#: type name size identifier
0: Apple_partition_scheme *233.8 GB disk0
1: Apple_partition_map 31.5 KB disk0s1
2: Apple_Boot 128.0 MB disk0s2
3: Apple_RAID 233.6 GB disk0s3
/dev/disk1
#: type name size identifier
0: Apple_partition_scheme *76.7 GB disk1
1: Apple_partition_map 31.5 KB disk1s1
2: Apple_HFS ioceanHD 76.6 GB disk1s3
/dev/disk2
#: type name size identifier
0: Apple_partition_scheme *233.8 GB disk2
1: Apple_partition_map 31.5 KB disk2s1
2: Apple_Boot 512.0 KB disk2s2
3: Apple_RAID 233.8 GB disk2s3
/dev/disk3
#: type name size identifier
0: Apple_HFSX iodata *233.6 GB disk3
===========================================

This showed that the old disk, ‘disk2′ (formated in Panther) had a slightly different partition mapping than the new disk, ‘disk0′. The disk2 had a smaller “Apple_Boot” (disk2s2) partition than the new disk formatted with Apple Disk Utility, thus, the main data partition (disk2s3) was larger than the old disk. The software RAID application won’t allow RAIDs with disks of dissimilar partition maps.

The solution:

Create a single filesystem disk out of the new disk using the Disk Utility partitioning option (HFS+ w/Case Insensitivity and Journaling) .

Copy the current filesystem running on the remaining disk of the degraded RAID set over to the newly formatted disk using Carbon Copy Cloner (or the Restore tab of Disk Utility).

Create a new unpaired mirror RAID set on the new disk, using the “enableRAID” command under the commandline ‘diskutil’ application.

Delete the old RAID array (do this with extreme caution because all of the data on this disk will be erased, i.e., make sure all of the data on this disk has been copied to the new unpaired RAID array before taking this step) using either Disk Utility (GUI) or diskutil (commandline).

Repartition the old disk (the remaining good disk in the old RAID array) so it matches the partition map of the new disk

Using the “repairMirror” option of the commandline ‘diskutil’ application, or by dragging the newly partitioned old disk into the new RAID set in Disk Utility, this disk is incorporated into the newly established RAID array.

The RAID repair, or rebuild, is run as a background process, which means that the computer continues to function online, though with somewhat degraded performance, throughout the rebuild. All user activity should be unaffected. If the problem with the different partition maps hadn’t cropped up, the entire process outlined above could have happened without ever taking the system offline. The physical disks are “hot-swappable” (they can be removed and inserted without taking the system offline) so they can be replaced and the RAID rebuilt without a break in service.

I’ve been using phpMyAdmin to generate database schemas. Since I use phpMyAdmin pretty much every day, it’s very convenient as opposed to DBDesigner, which resides on a separate server and has the overhead of connecting to and opening another app (just for one function).

The schema diagrams generated by phpMyAdmin are pretty, with colorful straight lines to represent table/field relations. However, the major pitfall is that these schema diagrams don’t show the data type. It’s usually helpful to know whether a field is an int, float, char, or text. This is one thing that DBDesigner does quite well.

Given that we have a couple meetings next week involving schemas (one of them being the start of a series of schema meetings), I decided to hack the phpMyAdmin code a bit so that it shows the data types side-by-side with the field names. This way I can continue to use phpMyAdmin to generate diagrams without the comprising of less information.

Beware, the rest of the post gets a bit techinical (aka geeky).

The only file I hacked was pdf_schema.php. This is located in the root dir of the phpmyadmin install.

This file contains the PMA_RT_Table class, which is where all the hacking was done.


class PMA_RT_Table {
// lots of code...
}

First, I added a new private var to store an array of types. This corresponds to the existing array of fieldnames for the class.


class PMA_RT_Table {
   var $fields = array(); // existing code
   var $types = array(); // my code
}

Next, in the constructor function, I added a line of code to populate the types array:


function PMA_RT_Table(...) {
// ...snip

       // load fields
       while ($row = PMA_DBI_fetch_row($result)) {
            $this->types[] = $row[1];  // my code
            $this->fields[] = $row[0];
        }

// ...snip
}

I then added code for the table width calculation to ensure that the drawn table diagrams would be wide enough to display both the field name and data type:


function PMA_RT_Table_setWidth($ff) {
// ...snip

      foreach ($this->fields AS $key => $field) {
           // srh hack to set width for field and type
          $type_arr = split(" +",$this->types[$key]);
          $type = $type_arr[0];
          if ('enum' == substr($type,0,4))
                $type = 'enum';
          if ('set' == substr($type,0,3))
                $type = 'set';
          $field .= " [{$type}]";

// ...snip
}

Finally, I added the same code to actually splice the field name with the data type:


function PMA_RT_Table_draw(...) {
// ...snip

      foreach ($this->fields AS $key => $field) {
           // srh hack to show data types next to field names
          $type_arr = split(" +",$this->types[$key]);
          $type = $type_arr[0];
          if ('enum' == substr($type,0,4))
                $type = 'enum';
          if ('set' == substr($type,0,3))
                $type = 'set';
          $field .= " [{$type}]";

// ...snip
}

I won’t bother going into depth about how and why I hacked the code. Doing so would be hard without presenting more context from the pdf_schema.php file.

The main purpose of this post is to provide me with a memory refresher so I can come back and reference it later at a critical time (i.e. upgrading phpmyadmin). Plus it’s always good to document work like this, no matter how you choose to document it. I chose the blog, and by doing so I am making visible a minor part of my work that would otherwise remain situated in oblivion.

View the difference!

» Before the hack (pdf)

» After the hack (pdf)

O’Reilly’s Ambient Findability (Peter Morville) provided a broad contemporary overview of data and information issues. Here are three recent publications relevant to today’s data environments:

-A Nature journal issue with Microsoft on 2020 Vision
http://research.microsoft.com/towards2020science/nature.htm;
http://www.nature.com/nature/journal/v440/n7083/index.html

-The term “dataspace” creates a wider-than-one-system conceptual umbrella for collections
MFranklin,. AHalevy, and DMaier, 2005.
From Databases to Dataspaces: A New Abstraction for Information Management,
ACM SIGMOD Record;
http://portal.acm.org/citation.cfm?id=1107499.1107502

-A look at the handling large amounts of streamed data - historic and real-time
Schandrasekaran and MFranklin. 2004.
Remembrance of Streams Past: Overload-Sensitive Management of Archived Streams.
Proceedings of the 30th VLDB Conference, Toronto, Canada
http://www.cs.berkeley.edu/~franklin/Papers/ChandrasekaranVLDB2004.pdf

This is an interesting read (pulled from my delicious links) on database optimization, and the downsides of normalization:

Normalized data is for sissies

The article links to a pdf presentation given by Cal Henderson, who helped create Flickr. A quick snippet:

In Flickr’s case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.

Mason and I have moved the dev CCE LTER into the Production area. Both the dev and production areas are versioned under Subversion. This makes it much easier to keep both sites in sync (with content, etc.).

The CCE LTER Site location is: http://cce.lternet.edu/