Tuesday, July 24, 2012

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994) http://biostor.org/reference/105389

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to http://data.gbif.org/occurrences/363089399/. Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as http://data.gbif.org/occurrences/478001337/. At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:
urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record http://data.gbif.org/occurrences/478001337/ contains a UUID as an alternative identifier:

If you Google this you discover a record in the Atlas of Living Australia http://biocache.ala.org.au/occurrences/bc58ce6b-3cc3-459a-9f5b-4a70a026afbe, which also lists the URN from the now deleted GBIF record http://data.gbif.org/occurrences/363089399/.

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.