Friday, July 5, 2013

Resolving Identifiers for Natural History Collections

The discussion of unique identifiers for natural history collections records and their corresponding specimens has gone on and on for decades, with the conversation heating up with every passing year. You can read more from BiSciCol and from SoYouThinkYouCanDigitize. This we know:
  • Must be globally unique
  • Best to mint/create at the Source System of Record (SSoR)
  • Should be passed to all 'downstream' data consumers and data aggregators
  • Data consumers and data aggregators need to preserve them
  • Should be easily resolvable back to the Source System of Record (SSoR)
  • Desirable to honor legacy identifiers, which may not be globally unique, but are already in use and cited in existing body of published literature.


When unique identifiers are used correctly, it is less important whether they follow competing standard A or B so long as they are unique. One way to envision this is to use a key and keychain metaphor.  The GUIDs become virtual keys and the keychain is the apparatus used to associate one with another.  Instead of each key belonging to a car, deadbolt lock, or locker they might belong to one or more specimens, events, citations, or localities.  Taking it one step further, ScioQualis lets users define relationships between different keys. 

So, what does an example of the implementation of GUIDs, minted by a collections database, passed to data consumers and resolvable on the web look like? (click on images for a larger view)

GUID created and maintained in collections database (SSoR).

Because the GUID is indexed with Google, it is resolvable by Google - or another search engine.

If public, the user sees up-to-date information from the source system of record - the collections database.

With this digital framework in place, when the collections manager of this collection shares this data with an data aggregator like GBIF, VertNet, iDigBio, or any other data consumer, the GUID is passed along. When data aggregators PRESERVE that GUID, you begin to see how easily data can be tracked, updated, etc. If aggregators then indexed their data with search engines, you can imagine the google results - you could find all digital instances of a single record held by data aggregators as well as the record within its home institution, in seconds. You could more easily find duplicates of that record. The list of advantages goes on and on.

Let's see what happens when we index alternate, legacy identifiers with google. Can we resolve them?

This record shows three alternate identifiers linked to a still image occurrence record. Two of the alternate identifiers are links, one of which leads to a web page for this living specimen at the Morton Arboretum, the other to a photo vouchered herbarium specimen in the Morton Arboretum Herbarium. The third alternate identifier is the code found on the Morton Arboretum tree tag. All three are indexed in Google.

If you search google on '306-58*3', you get hundreds of thousands of results. This value is far from globally unique. However, if you search on either of the URLs, your results are narrowed, to the relevant page at the Morton, and the relevant page in the collection database. 
This is an example of linking derivatives, in this case, the living specimen record, herbarium specimen and a still image occurrence record in another, unconnected database. Linking on URLs is not very stable as web addresses can change, obviously. But you get the idea. With GUIDs, this practice is cleaner. Indexing GUIDs and alternate legacy identifiers with search engines allows the data aggregators to become the springboards, if they persist the GUIDs and the collections databases can resolve them. Part of the resolution process here was indexing those identifiers with Google.

In future blog posts, we will discuss the following questions and some suggestions for best practices:
  • How do we assign GUIDs to duplicates - as in botanical specimens collected from the same individual (or colony) at the same time by the same person, at the same place and sent to another collection?
  • Do we assign and distribute GUIDs for all of the DwC classes?
  • How to we link GUIDs to legacy identifiers for specimens?
  • Should we go back and somehow affix this GUID to every specimen?
  • Does the GUID represent the actual object or the digital record?
  • Can they be human readable?

No comments:

Post a Comment