Thursday, May 15, 2014

ScioTR Available Now in the Win8 Store!


ScioTR ScioTR in the Win8 Store



What would it take to "Go Paperless?"

Do you have labels or small forms from which you need to get structured data - in an excel spreadsheet or CSV file?

ScioTR is a new touch-enabled Windows 8 app which integrates Optical Character Recognition (OCR), Consensus Strategy, and Machine Learning (ML) to provide an efficient workflow for digitizing images into custom data fields.

Download and Try it FREE from the Windows 8 Store!

Here is our latest walk through video showing a number of the app's features including how images are organized and how data is extracted from the images:


Thursday, October 24, 2013

Calling all label images in need of digitization!

Last summer, we presented a talk and demonstration of ScioTR, a new Windows 8 app for digitizing images of documents (specifically images of natural history collections labels), at the SPNHC conference in Rapid City, South Dakota. You can watch a video very similar to the content presented in South Dakota on our YouTube channel. For a description of the data entry workflow in ScioTR, visit ScioTR.com.

I would like to take this opportunity to solicit images of collections labels for testing purposes during our final development cycle. We are looking for various types of labels, from various disciplines, including those that are handwritten, typewritten in different fonts, not straight, dirty, etc. Of course, in return, you will receive an excel spreadsheet of the records in the format of your choice, complete with GUIDs. Here are the only requirements:
  1. Please limit the number of labels to the equivalent of no more than 200 specimens (some specimens may have more than one label/image). 
  2. Include a field list (or we can send the data back to you in Darwin Core)
  3. Images should have a resolution of 72 dpi or greater
Please email robin@scioqualis.com if you would like to participate.

Sunday, October 20, 2013

Storing Decimal Longitude and Latitude in your Collections Database

I spent a fair amount of time determining exactly what database datatype to store longitude and latitude decimal degrees in Scio Qualis. As I was doing a recent data migration, I had to describe it to someone and it occurred to me that I should blog about how I determined the best way to store decimal lat and long in a database while maintaining precision and accuracy, particularly when the data coming into the database may have been stored in other data types.
According to the Biogeomancer, Guide to Best Practices in Georeferencing, (pg 27) longitude and latitude decimal degree values to 0.00001 of a degree have a minimum uncertainty of 2 meters, no matter where you are on the globe. My Magellan GPS measures lat/long decimal values to one more digit from the decimal: 0.000001 of a degree and my cell phone GPS goes to seven digits past the decimal.  I would guess that the cell phone is using cell phone towers to determine this value and the traditional GPS uses satellites. The level of accuracy of these is another (long) discussion.
Most databases have float and decimal data types. With decimal lat and long, I would suggest using a decimal data type, but one could use a float with ample precision. Here is a good discussion of the difference in the context of long and lat. By using a combination of the decimal degrees, stored in the decimal database datatype AND a field that stores the maximum uncertainty as calculated with a tool like the MaNIS Georeferencing Calculator,  you can be sure to accurately capture the precision and accuracy of the measurement. This is the way we do it in ScioQualis.
For typical database decimal data types, one designates the precision and the scale of the decimal, like this: decimal(9,7).
In this case, 9 is the total number of digits (both left and right of the decimal point) stored and 7 is the number of decimal digits stored to the right of the decimal point. So, because decimal latitude can range from -90 to 90 and decimal longitude can range from -180 to 180, you would have the following data types:

Latitude: decimal(9,7)
Longitude: decimal(10, 7) 


Saturday, July 6, 2013

Maintaining Sticky Data


As far as we can tell, Rod Page coined the term 'Sticky Data' as it applies to biocollections. He brilliantly writes: "Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together. Given enough identifiers and enough data, then we could rapidly assemble a "ball" of interconnected data."

Liquidambar styraciflua-seeds

The BiSciCol (Biological Science Collections) group is interested in tracking scientific collections and their derivatives. In a recent blog post entitled: BiSciCol, Triples and Darwin Core, they identify the eight relationship triples that relate the six Darwin Core classes.

In theory, we know this linking is important to the usefulness of natural history collections and we know where the linking should occur within collections records. Darwin Core also gives us clues for how linking should occur between 2 or more occurrence records, publications, images, measurements/facts, taxon concepts, and a host of other types of data.

In practice, WHERE do you implement and maintain sticky, highly linked, collections data? The most effective place is in the System of Record (SOR), with the collections themselves and allow those links to flow downstream to data aggregators like GBIF. Read more about that in the previous blog entitled, Resolving Identifiers for Natural History Collections.

In ScioQualis, such links are made via database fields of data type GUID, linked by primary/foreign keys. Here is an example of what that might look like to the user who was linking three occurrence records to one another. In this case, the example is a fairy pin fungus growing on a bracket fungus growing on a fallen oak branch. 

The following screen capture shows two online citations linked to an occurrence record.


The last image shows a list of associated taxa. Each is linked to a taxon record and selected from a drop down. 



The images above represent how a record looks in ScioQualis when the user is logged into the system and has at least read-access to the record(s).  The collection administrator may choose to make those same records available to the public.  To see sample public views click on one or more of the following links:


http://www.scioqualis.com/Resolve.aspx?guid=e26bfc03-f331-e211-9944-00155d472a06
The above link shows the public version of an occurrence with one associated occurrence and the substrate as an associated taxon. All have links to various pieces (some of which aren't completely built out at this moment).

http://www.scioqualis.com/Resolve.aspx?guid=67f8da2f-a112-e211-aeff-8ca98299dd30
This link shows the public version of an occurrence with five measurements and two associated citations (some of which aren't completely built out at this moment).

My final thought here is that providing an infrastructure in which all of these data classes are linked and maintained is fairly complicated, but worth it. The whole thing is a web from which you can pluck any part and pull, making a new web.



Friday, July 5, 2013

Resolving Identifiers for Natural History Collections

The discussion of unique identifiers for natural history collections records and their corresponding specimens has gone on and on for decades, with the conversation heating up with every passing year. You can read more from BiSciCol and from SoYouThinkYouCanDigitize. This we know:
  • Must be globally unique
  • Best to mint/create at the Source System of Record (SSoR)
  • Should be passed to all 'downstream' data consumers and data aggregators
  • Data consumers and data aggregators need to preserve them
  • Should be easily resolvable back to the Source System of Record (SSoR)
  • Desirable to honor legacy identifiers, which may not be globally unique, but are already in use and cited in existing body of published literature.


When unique identifiers are used correctly, it is less important whether they follow competing standard A or B so long as they are unique. One way to envision this is to use a key and keychain metaphor.  The GUIDs become virtual keys and the keychain is the apparatus used to associate one with another.  Instead of each key belonging to a car, deadbolt lock, or locker they might belong to one or more specimens, events, citations, or localities.  Taking it one step further, ScioQualis lets users define relationships between different keys. 

So, what does an example of the implementation of GUIDs, minted by a collections database, passed to data consumers and resolvable on the web look like? (click on images for a larger view)

GUID created and maintained in collections database (SSoR).

Because the GUID is indexed with Google, it is resolvable by Google - or another search engine.

If public, the user sees up-to-date information from the source system of record - the collections database.

With this digital framework in place, when the collections manager of this collection shares this data with an data aggregator like GBIF, VertNet, iDigBio, or any other data consumer, the GUID is passed along. When data aggregators PRESERVE that GUID, you begin to see how easily data can be tracked, updated, etc. If aggregators then indexed their data with search engines, you can imagine the google results - you could find all digital instances of a single record held by data aggregators as well as the record within its home institution, in seconds. You could more easily find duplicates of that record. The list of advantages goes on and on.

Let's see what happens when we index alternate, legacy identifiers with google. Can we resolve them?

This record shows three alternate identifiers linked to a still image occurrence record. Two of the alternate identifiers are links, one of which leads to a web page for this living specimen at the Morton Arboretum, the other to a photo vouchered herbarium specimen in the Morton Arboretum Herbarium. The third alternate identifier is the code found on the Morton Arboretum tree tag. All three are indexed in Google.

If you search google on '306-58*3', you get hundreds of thousands of results. This value is far from globally unique. However, if you search on either of the URLs, your results are narrowed, to the relevant page at the Morton, and the relevant page in the collection database. 
This is an example of linking derivatives, in this case, the living specimen record, herbarium specimen and a still image occurrence record in another, unconnected database. Linking on URLs is not very stable as web addresses can change, obviously. But you get the idea. With GUIDs, this practice is cleaner. Indexing GUIDs and alternate legacy identifiers with search engines allows the data aggregators to become the springboards, if they persist the GUIDs and the collections databases can resolve them. Part of the resolution process here was indexing those identifiers with Google.

In future blog posts, we will discuss the following questions and some suggestions for best practices:
  • How do we assign GUIDs to duplicates - as in botanical specimens collected from the same individual (or colony) at the same time by the same person, at the same place and sent to another collection?
  • Do we assign and distribute GUIDs for all of the DwC classes?
  • How to we link GUIDs to legacy identifiers for specimens?
  • Should we go back and somehow affix this GUID to every specimen?
  • Does the GUID represent the actual object or the digital record?
  • Can they be human readable?

Wednesday, July 3, 2013

Introduction

This blog will detail many of the features in ScioQualis.com, an Online Natural History Collections (aka Biocollections) management database, and ScioTR, a integrative digitization tool for the rapid digitization of Natural History Collections labels. It will become something of a living user's manual and forum, in one place. The intention here is not to sound like a commercial for these software products, but to explain certain features in depth; particularly in cases where wide adoption of a technical strategy could result in a positive outcome for the community as a whole.

ScioQualis.com was modeled after the Darwin Core standard, but adds a number of other useful fields. It uses GUIDs (Globally Unique IDentifiers) throughout to link pieces of information together. It was specifically designed to address many of the high level needs (like those described in the NIBA Implementation Plan) that are specific to the accessibility and usefulness of Natural History Collections data while also offering an easy management solution for small and large collections alike, as well as collections in areas of the world in which access to technical resources for database development are scarce. Because ScioQualis is 100% online, it is accessible from anywhere there is an internet connection. For more information, please see the ScioQualis Training Page.

www.scioqualis.com

ScioTR is a Windows 8 'Modern' app, currently under development. In June of 2013, we presented the strategies and theories behind ScioTR’s comprehensive workflow process and later gave a live demo of the software in action at SPNHC, DemoCamp in South Dakota. A video covering content similar to that presented in the SPNHC demo is available on the ScioQualis YouTube Channel:


We welcome your comments and feedback - please contact us via the comments after blog posts and/or privately at robin (a) scioqualis.com.