Friday, February 20, 2015

More examples of data duplication and loss in GBIF: Australian bats in bits

Quick notes on another example of data duplication in GBIF. I'm in the process of building a tool to map specimen codes to GBIF records, and came across the following example. Consider the specimen code "AM M.22320", which is the voucher for the sequence KJ532444 (GenBank gives the voucher as M22320, but the associated paper doi:10.1016/j.ympev.2014.03.009 clarifies that this specimen comes from the Australian Museum). Locating this specimen in GBIF I found a series of records that were identical except for the catalogNumbers, which looked like this: M.22320.001, M.22320.002, etc. What gives?

Initially I thought this may be a simple case of data duplication (maybe the suffixes represent different versions of the same record?). Then I managed to locate the records on the Australian Museum web site:.

  • M.22320.009 - Wet Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.008 - Skull Preparation - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.001 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.005 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.006 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.007 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.003 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.004 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.002 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990 , Holotype
  • M.22320.010 - Tissue sample - Pteralopex taki Parnaby, 2002 - Solomon Islands, 5 km north of Patutiva Village, Marovo Lagoon, New Georgia Island , (8° 31' S , 157° 52' E), 25 Jun 1990

Turns out we have 10 records for "M.22320", which include various preparations and tissue samples. The holotype specimen for Pteralopex taki (originally described in doi:10.1071/AM01145, see BioNames) has generated 10 different records., all of which have ended up in GBIF. Anyone using GBIF occurrence data and interpreting the number of occurrence records as a measure of how abundant an organism is at a given locality is clearly going to be misled by data like this.

One way to tackle this problem would be if GBIF (or the data provider) could cluster the records that represent the "same" specimen, so GBIF doesn't end up duplicating the same information (in this case, 10-fold). The Australian Museum records don't seem to specify a direct link between the 10 records. I then located the same records in OZCAM, the data provider that feeds GBIF. Here is the OZCAM record for "M.22320.001": http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064. OZCAM doesn't have the information on whether the record is a skull, a wet preparation, or a tissue sample, that information has been lost, and hence doesn't make it as far as GBIF.

Note that OZCAM has resolvable identifiers for each specimen in the form of UUIDs that are appended to "http://ozcam.ala.org.au/occurrence/". The corresponding UUIDs are included in the Darwin Core dump that OZCAM makes available to GBIF. Here they are for the parts of M.22320:

"223d1549-1322-419e-8af4-649a4b145064","M.22320.001",...
"c40a7eea-6e04-4be6-8dcb-4473402e48c4","M.22320.002",...
"21fcaea1-c645-49d9-9753-dbd9dd2bc64a","M.22320.003",...
"34ffd935-9fb4-44a5-acb8-2cd4df5ade62","M.22320.004",...
"03635fb8-f9ac-4c4c-898b-859cd42f1e26","M.22320.005",...
"a1c4dd5a-dc03-45cc-8971-931c739df8b2","M.22320.006",...
"71c91030-405c-4390-8ec3-42a5478a2fd8","M.22320.007",...
"0f1a9326-34d0-4fb2-b89a-9856bd9082f0","M.22320.008",...
"86270ef7-07f6-4395-84c7-66d5d497cc01","M.22320.009",...

But when GBIF parses the dump it ignores these UUIDs, which means the GBIF user can't easily go to the OZCAM site (which has a bunch of other useful information, compare http://ozcam.ala.org.au/occurrence/223d1549-1322-419e-8af4-649a4b145064 with http://www.gbif.org/occurrence/774916561/verbatim ). It also means that GBIF has stripped out an identifier that we might make use of to unambiguously refer to each record (and, presumably, this UUID doesn't change between harvests of OZCAM data).

In summary, this is a bit of a mess: we have multiple records that are really just bits of the same specimen but which are not linked together by any data provider, and as the data is transmitted up the chain to GBIF clues as to what is going on are stripped out. For a user like me who is trying to link the GenBank sequence to its voucher this is frustrating, and ultimately all rather avoidable if we took just a little more care in how we represent data about specimens, and how we treat that data as it gets transmitted between data bases.

Thursday, February 19, 2015

Post publication review on PubPeer

036e889646a0ff112e5f150a625f5268PubPeer is a web site where people can discuss published articles, anonymously if they prefer. I finally got a chance to play with it a few days, it it was a fascinating experience. You simply type in the DOI or PMID for an article and see if anyone has said anything about that article. It also automatically pulls comments from PubMed Commons, for example the article Putting GenBank data on the map has a comment that was originally published as a guest post on this blog. PubPeer knows about this blog post via Altmetric, which is another nice feature. PubPeer also has browser extensions which, if you install one, automatically flags DOIs on web pages that have comments on PubPeer. Also nice.

So, I took PubPeer for a spin. While browsing GenBank and GBIF, as you do, I came across the following paper: "Conservation genetics of Australasian sailfin lizards: Flagship species threatened by coastal development and insufficient protected area coverage" doi:10.1016/j.biocon.2013.10.014. Some of the sequences from this paper, such as KF874877 are flagged as "UNVERIFIED". Puzzled by this, I raised the issue on PubPeer (see https://pubpeer.com/publications/D1090D7AF8178B1A10C4C45AC1006E ). A little further digging led to the suggestion that they were numts. After raising the issue on Twitter, one of the authors (Cameron Siler) got in touch and reported that there had been an accidental deletion of a single nucleotide in an alignment. Cameron is updating the Dryad data (http://dx.doi.org/10.5061/dryad.1fs7c ) and GenBank sequences.

I like the idea that there is a place we can go to discuss the contents of a paper. It's not controlled by the journal, and you can either identify yourself or remain anonymous if you prefer. Not everyone is a fan of this mode of commentary, especially it is possible for people to make all sorts of accusations while remaining anonymous. But it's a fascinating project, and well worth spending some time browsing around (what IS it with physicists?). For anyone interested in annotating data, it's also a nice example of one possible approach.