Last year, members from various LTER sites collaborated in creating the LTER EML Unit Registry. This made possible having an authorative source of units for reference in generating EML documents.

LTER Unit metersPerSquareSecond
Though the Unit Registry effort has been successful, there have been a few technical drawbacks. One issue was dealing with “junk” characters in the unit abbreviation field. This is the result of different character encoding types conflicting with each other.

For example, the unit metersPerSquareSecond should have an abbreviation m/s2. However, the LTER EML Unit Registry page is using a charset encoding of iso-8859-1. This encoding type causes the “junk” characters to appear. The picture below shows the source code from the LTER EML Unit Registry home page.

charset=iso

To solve this issue locally, I set the charset encoding type to UTF-8. This Unicode standard ensures that the correct characters appear…. among these are the superscript 2 and 3 (for squared and cubed respectively), and the greek letters Mu (for micro) and Omega (for ohm). The picture below shows the source code from the Ocean Informatics Datazoo home page. The Palmer LTER and CCE LTER Unit Registries are kept in sync with each other.

charset=utf

Notes:
- To remove the junk characters, I copied and pasted “Special Characters…” from the Safari Browser Edit window.
- No changes were required in the MySQL Collation, contrary to initial thought. MySQL is able to store Unicode-encoded strings as text datatypes, using our default Collation of latin1_swedish_ci.
- Unicode-encoded strings should not be wrapped by the htmlentities() function in PHP. This will cause the “junk” characters to appear.
- This page was a good reference for working with Unicode in MySQL and PHP. Additionally, the O’Reilly book Bulding Scalable Web Sites has an entire chapter devoted to character encoding. This book was authored by Cal Henderson of Flickr fame. I was able to read parts of the book at Safari Tech Books Online.