Thursday, December 12, 2013

Guest post: response to "Putting GenBank Data on the Map"

DES Tahiti 09 biggerThe following is a guest blog post by David Schindel and colleagues and is a response to the paper by Antonio Marques et al. in Sciencedoi:10.1126/science.341.6152.1341-a.

Marques, Maronna and Collins (1) rightly call on the biodiversity research community to include latitude/longitude data in database and published records of natural history specimens. However, they have overlooked an important signal that the community is moving in the right direction. The Consortium for the Barcode of Life (CBOL) developed a data standard for DNA barcoding (2) that was approved and implemented in 2005 by the International Nucleotide Sequence Database Collaboration (INSDC; GenBank, ENA and DDBJ) and revised in 2009. . All data records that meet the requirements of the data standard include the reserved keyword 'BARCODE'. The required elements include: (a) information about the voucher specimen from which the DNA barcode sequence was derived (e.g., species name, unique identifier in a specimen repository, country/ocean of origin); (b) a sequence from an approved gene region with minimum length and quality; and (c) primer sequences and the forward and reverse trace files. Participants in the workshops that developed the data standard decided to include latitude and longitude as strongly recommended elements but not as strict requirements for two reasons. First, many voucher specimens from which BARCODE records are generated may have been collected before GPS devices were available. Second, barcoding projects such as the Barcode of Wildlife Project (4) are concentrating on rare and endangered species. Publishing the GPS coordinates of collecting localities would facilitate illegal collecting and trafficking that could contribute to biodiversity loss.

The BARCODE data standard is promoting precisely the trend toward georeferencing called for by Marques, Marrona and Collins. Table 1 shows that there are currently 346,994 BARCODE records in INSDC (3). Of these BARCODE records, 83% include latitude/longitude data. Despite not being a required element in the data standard, this level of georeferencing is much higher than for all cytochrome c oxidase I gene (COI), the BARCODE region, 16S rRNA, and cytochrome b (cytb), another mitochondrial region that was used used for species identification prior to the growth of barcoding. Data are also presented on the numbers and percentages of data records that include information on the voucher specimen from which the nucleotide sequence was obtained. In an increasing number of cases, these voucher specimen identifiers in INSDC are hyperlinked to the online specimen data records in museums, herbaria and other biorepositories. Table 2 provides these same data for the time interval used in the Marques et al. letter (1). These tables indicate the clear effect that the BARCODE data standard is having on the community’s willingness to provide more complete data documentation.

Table 1. Summary of metadata for GPS coordinates and voucher specimens associated with all data records.
Categories of data recordsTotal number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE347,349286,975 (83%)347,077 (~100%)
All COI751,955365,949 (49%)531,428 (71%)
All 16S4,876,284461,030 (9%)138,921 (3%)
All cytb239,7967,776 (3%)84,784 (35%)

Table 2.
Summary of metadata for GPS coordinates and voucher specimens associated with data records submitted between 1 July 2011 and 15 June 2013.
Total number of GenBank recordsWith Latitude/LongitudeWith Voucher or Culture Collection Specimen IDs
BARCODE160,615132,192 (82%)160,615 (100%)
All COI302,507166,967 (55%)231,462 (77%)
All 16S1,535,364232,567 (15%)49,150 (3%)
All cytb74,6312,920 (4%)24,386 (33%)


The DNA barcoding community's data standard is demonstrating two positive trends: better documentation of specimens in natural history collections, and new connectivity between databases of species occurrences and DNA sequences. We believe that these trends will become standard practices in the coming years as more researchers, funders, publishers and reviewers acknowledge the value of, and begin to enforce compliance with the BARCODE data standard and related minimum information standards for marker genes (5).

DAVID E. SCHINDEL1, MICHAEL TRIZNA1, SCOTT E. MILLER1, ROBERT HANNER2, PAUL D. N. HEBERT2, SCOTT FEDERHEN3, ILENE MIZRACHI3
  1. National Museum of Natural History, Smithsonian Institution Smithsonian Institution, Washington, DC 20013–7012, USA.
  2. University of Guelph, Ontario, Canada
  3. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA

References

  1. Marques, A. C., Maronna, M. M., & Collins, A. G. (2013). Putting GenBank Data on the Map. Science, 341(6152), 1341–1341. doi:10.1126/science.341.6152.1341-a
  2. Consortium for the Barcode of Life, http://www.barcodeoflife.org/sites/default/files/DWG_data_standards-Final.pdf (2009)
  3. Data in Tables 1 and 2 were drawn from GenBank (http://www.ncbi.nlm.nih.gov/genbank/) [data as of 1 October 2013]
  4. Barcode of Wildlife Project, http://www.barcodeofwildlife.org (2013)
  5. Yilmaz, P., Kottmann, R., Field, D., Knight, R., Cole, J. R., Amaral-Zettler, L., Gilbert, J. A., et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), 415–420. doi:10.1038/nbt.1823

Wednesday, December 04, 2013

Towards BioStor articles marked up using Journal Archiving Tag Set

A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content:

Workflow
I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:
  1. BioStor articles need to be archived somewhere. At the moment they live on my server, and metadata is also served by BHL (as the "parts" you see in a scanned volume). Long term maybe PubMed Central is a possibility (BHL essentially becomes a publisher). Imagine PubMed Central becoming the primary archival repository for biodiversity literature.
  2. BioStor articles could be more useful if the OCR text was cleaned up and marked up (e.g., highlighting taxon names, localities, extracting citations, etc.).
  3. If BioStor articles were marked up to same extent as ZooKeys then we could use tools developed for ZooKeys (see Towards an interactive taxonomic article: displaying an article from ZooKeys) for a richer reading experience.
  4. Cleaned OCR text could also be used to generate searchable PDFs, which are still the most popular way for people to read articles (see Why do scientists tend to prefer PDF documents over HTML when reading scientific journals?). BioStor already generates PDFs, but these are simply made by wrapping page images in a PDF. Searchable PDFs would be much friendlier.

For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.

The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article "Microsporidian encephalitis of farmed Atlantic salmon (Salmo salar) in British Columbia" which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).

With this in mind, I dusted off some old code, put it together and created an example of the first baby steps towards BioStor and JATS. The code is in github, and there is a live example here.

Jats
The example takes BioStor article 65706, converts the metadata to JATS, links in the page scans, and also extracts images from the page scans based on data in the ABBYY OCR files. I've also generated HTML from the DjVu files, and this HTML includes hOCR tags that embed information about the OCR text. This format can be edited by tools such as (see Jim Garrison's moz-hocr-edit discussed in Correcting OCR using hOCR in Firefox). This HTML can be processed to output a PDF that includes the page scans but also has the OCR text as "hidden text" so the reader can search for phrases, or copy and paste the text (try the PDF for article 65706).

I've put the HTML (and all the XML and images) in github, so one obvious model for working on an article is to put it into a github repository, push any edits made to the repository, then push that to a web server that displays the articles.

There are still a lot of rough edges, and I think we can buld nicer interfaces than moz-hocr-edit (e.g., using the "contenteditable" attribute in the HTML), althogh moz-hocr-edit has the nice feature of being able to save the edits straight back to the HTML file (saving edited HTML to disk is a non-trivial task in most web browsers). I also need to add the code for building the initial JATS file (currently this is hidden on the BioStor server). There are also issues about PDF quality. At the moment I output black and white PNGs, which look nice and clean but can mangle plates and photos. I need to tweak that aspect of the process.

One application of these tools would be to take a single journal and convert all the BioStor articles into JATS, then make it available for people to further clean and markup as needed. There is an extraordinary amount of information locked away in this literature, it would be nice if we made better use of that treasure trove.

Thursday, November 21, 2013

GBIF, GitHub, and taxonomy (again)

Quick notes on yet another attempt to marry the task of editing a taxonomic classification with versioning it in GitHub.

The idea of dumping the whole GBIF classification into GitHub as a series of nested folders looks untenable. So, maybe there's another way to tackle the problem.

Let's imagine that we dump, say, the GBIF classification down to family-level as a series of nested folders (i.e., we recreate the classification on disk). For each family we then create a bunch of files and store them in that folder. For example, we could have the classification in Darwin Core Archive format (basically, delimited text). Let's also create a graph that corresponds to that classification, using a format for which we have tools available for visualising and editing.

For example, I've created a Graph Modelling Language (GML) file for the Pinnotheridae here. Using software such as yEd I can load this file, display it, and edit it. For example, below is a compact tree layout of the graph:

Pinnotheridae

This image is a bitmap, if you opened the GML file in yEd it would be interactive, and you could zoom in, alter the layout, edit the graph, etc.

Looking at the graph there are a few oddities, such as "orphan" genera that lack any species, and some names that appear very similar. For example, there is an orphan genus Glassella, and a similar genus Glassellia (note the "i") with a single species Glassellia costaricana. A little digging in BioNames shows that Glassellia is a misspelling of Glassella. The original description appears in:

E Campos, M K Wicksten (1997) A New Genus For The Central American Crab Pinnixa costaricana Wicksten, 1982 (Crustacea: Brachyura: Pinnotheridae). Proceedings of the Biological Society of Washington 110(1): 69–73. http://biostor.org/reference/81137
So, we have one genus that appears twice due to a typo. Furthermore, there are nodes in the graph for the taxa Glassellia costaricana and Pinnixa costaricana, but these are the same thing (the names are synonyms, albeit Glassellia costaricana has the genus misspelt). So, we could delete Pinnixa costaricana, delete the mispelling Glassellia, fix the misspelling in Glassellia costaricana, and move it to the correctly spelt Glassella. There are other problems with this classification, but let's leave them for the moment.

Now, imagine that after editing I use the graph to regenerate the DWCA file, which now has the edited classification. I then commit the changes to GitHub, and anyone else (including GBIF) could grab the DWCA and, for example, replace their Pinnotheridae classification with the edited version.

We could also go further, and add what i think is a missing component of the GBIF classification, namely a link to the nomenclators. For example, in an ideal world we would have each name in the classification linked to a stable identifier for the name provided by a nomenclator, and that nomenclator would know, for example, that Pinnixa costaricana and Glassella costaricana were objective synonyms. If we had those links then we could automatically detect cases such as this where logically you can have either Pinnixa costaricana or Glassella costaricana in the same classification, but not both.

There are some wrinkles to figure out, for example it would be nice to compute the difference between the original and edited graphs in terms of graph operations (not simply the difference as text files) so we could do things like list nodes that have been moved or deleted. I did some work on this a while back (Page, R. D., & Valiente, G. (2005).BMC Bioinformatics, 6(1), 208. doi:10.1186/1471-2105-6-208), something like that tool might do the trick.

There is an element here of trying to coerce a problem into a form that can existing tools can solve, but in a way that's what makes it attractive. If we can use things that already exist then we can move from talking about it to actually doing it.

Wednesday, November 20, 2013

Reaction to taxonomic reactionaries

Nature header edThere is a fairly scathing editorial in Nature [The new zoo. (2013). Nature, 503(7476), 311–312. doi:10.1038/503311b ] that reacts to a recent paper by Dubois et al.:

Dubois, A., Crochet, P.-A., Dickinson, E. C., Nemésio, A., Aescht, E., Bauer, A. M., Blagoderov, V., et al. (2013). Nomenclatural and taxonomic problems related to the electronic publication of new nomina and nomenclatural acts in zoology, with brief comments on optical discs and on the situation in botany. Zootaxa, 3735(1), 1. doi:10.11646/zootaxa.3735.1.1

To quote the editorial:

...there might be more than a disinterested concern for scientific integrity at work here. A typical reader of the Zootaxa paper (not that there are typical readers of a 94-page work on the minutiae of nomenclature protocol) might reasonably conclude that the authors have axes to grind. Exhibits A–E: the high degree of autocitation in the Zootaxa paper; the admission that some of the authors were against the ICZN amendments; that they clearly feel that their opinions regarding the amendments have been disregarded; the ad hominem attacks on ‘wealthy’ publishers as opposed to straitened natural-history societies; and the use of emotive and occasionally intemperate language that one does not associate with the usually dry and legalistic tone of debate on this subject. (The online publisher BioMed Central, based in London, gets a particular pasting, to which it has responded; see http://blogs.biomedcentral.com/bmcblog/2013/11/15/the-devil-may-be-in-the-detail-but-the-longview-is-also-worth-a-look/.)

One of many recommendations made in the diatribe is that journals should routinely have on their review boards those expert in the business of nomenclature — in other words, a cadre of people who are, unlike ordinary mortals, qualified to interpret the mystic strictures of the code. A typical reader is again entitled to ask whom, apart from themselves, the authors think might be suitable candidates.

Ouch! But Dubois et al.'s paper pretty much deserves this reaction - it's a reactionary rant that is breathtaking in it's lack of perspective. From the abstract:

As shown by several examples discussed here, an electronic document can be modified while keeping the same DOI and publication date, which is not compatible with the requirements of zoological nomenclature. Therefore, another system of registration of electronic documents as permanent and inalterable will have to be devised.

So, we have an identifier system for publications which currently has 63,793,212 registered DOIs (see CrossRef), includes key journals such as Zootaxa and ZooKeys, and which has tools to support versioning of papers (see CrossMark) but hey, let's have our own unique system. After all, zoological nomenclature is special, and our community has such a good track record of maintaining our own identifier system (LSIDs anyone?).

Now that the financial crisis faced by the ICZN has been averted by a three-year bail-out by the National University of Singapore (for three years at least), maybe the guardians of scientific names can focus on providing tools and services of value to the broader scientific community (or, indeed, taxonomists). As it stands, the ICZN can say little about the majority of animal names. Much better to focus on that than trying to rail against the practices of modern publishing.

Monday, November 11, 2013

Names and nomenclators: just do it already!

Quick notes on taxonomic names (again). It's a continuing source of bafflement that the biodiversity community is making a dog's breakfast of names. It seems we are forever making it more complicated than it needs to be, forever minting new acronyms that pollute the landscape without actually contributing anything useful, and forever promising shiny new tools and services without every actually delivering them. Meanwhile people and projects that build upon names are left to deal with a mess.

It seems to me that it would be nice if we had a single place to go to get definitive information on a name, and that place would give us a unique identifier that we could use in our own databases as a way to clean up and reconcile our data. For example, if we have a bibliographic database we can map citations to DOIs and then use those to identify the articles. If we have a list of journal names, we can map those to ISSNs and clean up our data. Likewise, if we have a classification such as GBIF or NCBI, we should be able to map the names in those classifications onto standard identifiers for taxonomic names.

The frustrating thing is we already have standard identifiers for taxonomic names. Since around 2005 we have been serving LSIDs for plant and animal names. We have Index Fungorum, IPNI, ION, and ZooBank, all serving LSIDs, all serving RDF, all using the same TDWG vocabulary.

The nomenclators vary in size and scope, but we have the three major, multicellular eukaryotes covered (circles proportional to number of names in each database):
Nomenclators

There is some duplication, both within nomenclators (IPNI and ION I'm looking at you) and between nomenclators (ION and ZooBank have the same scope, although ZooBank is dwarfed by ION, anyone care to explain why we have both...?). All four databases are actively growing, partly through direct registration of new taxonomic names.

So, we're basically done, right? Surely all we need to do is harvest the LSIDs for all these names, put them into a single triple store, and wrap some basic services around them? If the nomenclators provide a list of recent changes (e.g., as an RSS feed) then we could continuously update the store with new names. Then any database or classification could reconcile it's names with those in the nomenclators. They could also then augment their own records by making use of additional data the nomenclators have, such as objective synonomies and links to original descriptions. In other words, we could have a model like this:
Taxonmodel
Classifications represent a view of how taxa are related, the names associated with those taxa are stored in nomenclators. This means that classification databases like GBIF and NCBI are not in the business of managing names, they simply link to the nomenclators (in the same way that a bibliographic database can link to DOI, ISSNs, and author ids such as ORCID and VIAF).

We have almost all of this infrastructure in place already. In one of the unsung triumphs of TDWG we have all the nomenclators serving data in the same format using the same technology. And yet we have singly failed to do anything useful with this extraordinary resource! Instead we seem more interested in contributing more projects to the acronym soup of biodiversity informatics. All around us projects to assign and link identifiers for publications (CrossRef), data (DataCite), and people (ORCID) are taking off. The infrastructure for taxonomic names has been in place since 2005, we could be doing the same sort of things CrossRef, DataCite and ORCID are doing in their domains. Why aren't we?

Wednesday, November 06, 2013

ZooKeys, GBIF, and GitHub: fixing Darwin Core Archives part 2

Here's another example of a Darwin Core Archive that is "broken" such that GBIF is missing some information. GBIF data set A checklist to the wasps of Peru (Hymenoptera, Aculeata) comes from Pensoft, and corresponds to the paper:
Rasmussen, C., & Asenjo, A. (2009). A checklist to the wasps of Peru (Hymenoptera, Aculeata). ZooKeys, 15(0). doi:10.3897/zookeys.15.196

As with the previous example GBIF says there are 0 georeferenced records in this dataset. This is odd, because the ZooKeys page for this article lists three supplementary files, including KML files for Google Earth. I've used one to create the image below:

GoogleEarth Image

So, clearly there is georeferenced data here. Looking at the Darwin Core Archive (which I've put on GitHub there are a bunch of issues with this data. The occurrence.txt file has decimal latitude and longitude values with a comma rather than a decimal point, the file has some character encoding issues, and the columns with latitude and longitude data are labelled as "verbatim" fields not "decimal" fields. All of this means GBIF lacks all the point data for this dataset (over 2000 records). If we fix these problems, we get a map like this:



This illustrates one problem with publishing data, namely the data is rarely checked in the same way a manuscript is. Peer-review of data is a phrase that always struck me as odd, because you only get to be able to evaluate a data set by using it. In other words, data almost demands post- rather than pre-publication review. It's only when people start trying to use the data that problems emerge.

At the same time, we could improve checking of data prior to publication. In the case of the Darwin Core Archives I've looked at so far, it would be easier to find the problems if we had a simple tool that could take a Darwin Core Archive, extract the information and display it in various ways. If, for example, we have georeferenced records but we don't get a map, we would immediate wonder why that was, and figure out what the problem was. At the moment it seems easy to send data to GBIF, thinking you are contributing important information, whereas in fact that information never makes it onto a GBIF map.

GBIF and Github: fixing broken Darwin Core Archives

Following on from Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite here's a quick and dirty example of using GitHub to help clean up a Darwin Core Archive.

The dataset 3i - Cicadellinae Database has 2,152 species and 4,749 taxa, but GBIF says it has no georeferenced data. As a result, the map for this dataset looks like this:

Gbif 3i


I downloaded the Darwin Core Archive and was puzzled because the occurrence.txt file contained in the archive has latitude and longitude pairs for some of the records. How come there is no map? After a bit of fussing I discovered that the meta.xml file that describes the data is broken. It lists a column which doesn't appear in the data file, so everything after that column gets shifted along and hence the column headings for latitude and longitude are out of alignment with the data.

So, I loaded the Darwin Core Archive into GitHub (you can see it here), then fixed the error, and then for fun extracted the latitude and longitude pairs as a GeoJSON file. GitHub can display this on a map:


Note that we now have a fairly extensive set of georeferenced data points for these insects, and this data hasn't made it onto a GBIF map because of a simple error in the metadata. I keep finding cases like this, which suggests that GBIF has more georeferenced data than it realises.

Saturday, November 02, 2013

Catalogue of Life and LSIDs: a catalogue of fail

Dvd front coverI have a love/hate relationship with the Catalogue of Life (CoL). On the one hand, it's an impressive achievement to have persuaded taxonomists to share names, and to bring those names together in one place. I suspect that Frank Bisby would feel that the social infrastructure he created is his lasting legacy. The social infrastructure is arguably more impressive than the informatics infrastructure, in particular, the Catalogue has consistently failed to support globally unique identifiers for its taxa.

If you visit the CoL web pages you will see Life Science Identifiers (LSIDs) for taxa, such as urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20130401 for the African elephant Loxodonta africana. The rationale for using LSIDs in CoL is explained in the following paper:
Jones, A. C., White, R. J., & Orme, E. R. (2011). Identifying and relating biological concepts in the Catalogue of Life. Journal of Biomedical Semantics, 2(1), 7. doi:10.1186/2041-1480-2-7
This paper describes the implementation in great detail, but this is all for nought as CoL LSIDs don't resolve. In fact, as far as I'm aware, CoL LSIDs have not resolved since 2009. Here is a major biodiversity informatics project that seems incapable of running a LSID service. These LSIDs are appearing in other projects (e.g., Darwin Core Archives harvested by GBIF), but they are non-functioning. Anyone using these LSIDs on the assumption that they are resolvable (or, indeed, that CoL cared enough about them to ensure they were resolvable) is sadly mistaken.

Jones et al. list some projects that use CoL LSIDs, including the Atlas of Living Australia (ALA). While I have seen CoL LSIDs used by ALA in the past, it now seems that they've abandoned them. Resolving a LSID such as urn:lsid:biodiversity.org.au:afd.name:433239 (Dromaius novaehollandiae) (using, say the TDWG resolver) we see the following LSID: urn:lsid:biodiversity.org.au:col.name:6847559. This corresponds to the record for Dromaius novaehollandiae for the 2011 edition of the Catalogue of Life. ALA have constructed their own LSID using an internal identifier from CoL. This is the very situation working CoL LSIDs should have made unnecessary. As Jones et al. note:
Prior to the introduction of LSIDs, the CoL was criticized for using identifiers which changed from year to year [29]. The internal identifiers have never been intended to be used in other systems linking to the CoL, of course, but this criticism draws attention to the demand for persistent identifiers that are designed for use by other systems. The CoL still does not guarantee to maintain the same internal identifiers, because there appears to be no need to insist on this as a requirement, but it does now provide persistent globally unique, publicly available identifiers.
That would be fine if, in fact, the identifiers were persistent. But they aren't. Because CoL have been either unable or unwilling to support their own LSIDs, ALA has had to program around that by minting their own LSIDs for CoL content! Note that these ALA LSIDs are tied to a specific version of CoL. Record 6847559 exists in the 2011 edition (http://www.catalogueoflife.org/annual-checklist/2011/details/species/id/6847559) but not the latest (2013), where Dromaius novaehollandiae is now http://www.catalogueoflife.org/annual-checklist/2013/details/species/id/11908940.

Versioning LSIDs


One of features of LSIDs that has caused the most heartache is versioning. Just because this feature is there doesn't mean it is necessary to use it, and yet some LSID providers insist on versioning every LSID. CoL is such an example, so with every release the LSID for every taxon changes. In my opinion, versioning is one of the most discussed and most over-rated features of any identifier. Most people, I suspect, don't want a version, they want the latest version. They want to be able to have links that will always get them to the current version. This is how Wikipedia works, this is how DOIs work (see CrossMark). In both cases you can see the existence of other versions, and go to them if needed. But by putting versions front and centre, and by not enabling the user to simply link to the latest version, CoL have made things more complicated than they need to be.

Changing LSIDs

It needs to be understood that in relation to concepts the Catalogue is intentionally not stable, so if a client is wishing to link to a name, not a concept, the client should use any LSID available for the name (or just the name itself), not a CoL-supplied taxon LSID. It should also be noted that it is intended that deprecated concepts will be accessible via their LSIDs in perpetuity, and the meta- data retrieved will include information about the concepts’ relationships to relevant current concepts (such as inclusion, etc.). - Jones et al. p. 14
Leaving aside the fact that CoL clearly has a different notion of "perpetuity" to the rest of us, the notion that identifiers change when content changes is potentially problematic. If a taxonomic concept changes CoL will mint a new LSID. While I understand the logic, imagine if other databases did this. Imagine if the NCBI decided that because the African elephant was two species instead of one (see doi:10.1126/science.1059936), they should change the NCBI tax_id of Loxodonta africana (tax_id 9785, first used in 1993) because our notion of what "Loxodonta africana" meant has now changed. Imagine the chaos this could cause downstream to all the databases that build upon the NCBI taxonomy, which would now link to an identifier the NCBI had dropped. Instead, NCBI simply added a new identifier for Loxodonta cyclotis. Yes, this means the notion of "Loxodonta africana" may now be ambiguous (if it was sequenced before 2001, did the authors sequence Loxodonta africana or Loxodonta cyclotis?), but given the choice I suspect most could live with that ambiguity (as opposed to rebuilding databases).

But, even if we accept CoL's approach of changing LSIDs if the concept changes, surely concepts that don't change should always have the same LSID (except for changes in the version at the end)? Turns out, this is not always the case. For example, here are the CoL LSIDs for Loxodonta africana from 2008 to 2013:


urn:lsid:catalogueoflife.org:taxon:de5724e4-29c1-102b-9a4a-00304854f820:ac2008
urn:lsid:catalogueoflife.org:taxon:de5724e4-29c1-102b-9a4a-00304854f820:ac2009
urn:lsid:catalogueoflife.org:taxon:24f8e252-60a7-102d-be47-00304854f810:ac2010
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20110201
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20120124
urn:lsid:catalogueoflife.org:taxon:d242422d-2dc5-11e0-98c6-2ce70255a436:col20130401


The core part of the LSID (the UUID highlighted in bold) has changed twice. But in each release of these versions of CoL there have only been two species of Loxodonta, L. africana and L. cyclotis. How is the 2008 concept of Loxodonta africana different from the 2010, or the 2011 concept?

Summary


As we start to tackle issues such as data quality and annotation, having persistent, resolvable, globally unique identifiers will matter more than ever. Shared identifiers are the glue that helps us bind diverse data together. The tragedy of LSIDs is that they could have been this glue if our community had chosen to invest even a fraction of the effort CrossRef invested in DOIs. Unfortunately we are now left with web sites and databases littered with LSIDs that simply don't work (CoL is not the only offender in this regard).

Resolvable identifiers mean we can actually get information about the things identified, as well as serving as a litmus test of the credibility of a resource (if I give you a URL and the URL doesn't work, you may doubt the value of the information on the end of that link). In a networked world, the trustworthiness of a resource is closely bound to its ability to maintain identifiers. The Catalogue of Life fails this test.

DoubleFacepalm2 zps6e8e47eb

Friday, November 01, 2013

Annotating and cleaning GBIF data: Darwin Core Archive, GitHub, ORCID, and DataCite

E9815d877cd092a19918df74e04f0415GbifTwittergithub2TwitterThis is a quick sketch of a way to combine existing tools to help clean and annotate data in GBIF, particularly (but not exclusively) occurrence data.

GitHub


The data provider puts a Darwin Core Archive (expanded, not zipped) into a GitHub repository. GBIF forks the repository, cleans the data, and uploads that to GBIF to populate the database behind the portal.

DOI


When GBIF firsts loads the repository it assigns it a DOI (using, say, DataCite). Actually we assign two DOIs, one for this version of the data (e.g., 10.1234/data.v1) and one for all versions of the data, say 10.1234/data. The data is considered to be published, authorship is determined by the provider, which may be an individual, a project, an institution, etc.

Big scale annotation and cleaning


Anyone familiar with GitHub can fork the repository of data and do their own cleaning (e.g., fixing dates, latitudes and longitudes, links to taxon names, etc.).

Small scale, casual annotation


Anyone visiting the GBIF portal and noticing an error (or something that they want to comment on) does so on the portal. Behind the scenes these comments are stored as issues on the GBIF repository in GitHub. To do this GBIF can either (a) enable users with an existing GitHub account to link that to their GBIF user account, or (b) create a GitHub account for the user. The user need not actually interact directly with GitHub (a similar approach is described by Mark Holder for the social curation of phylogenetic studies).

This means all annotation, big or small, is in the open and on GitHub. There is very little programming to do, GBIF simply talks to GitHub using GitHub's API. GBIF could display known "issues" for a dataset, so portal users immediate know if any data has been flagged as problematic.

All the annotations belong to the "community", in the sense that each annotation is linked to GitHub user (even if the user might not ever actually go to GitHub). This also means that the provider can, at any point, pull in those annotations so they can update their own data (and hence gain direct benefit form exposing it in the first place).

Updating


When GBIF decides that enough annotations have been made and resolved, the latest version of the repository is loaded into GBIF and gets a new DOI (e.g., 10.1234/data.v2). This means an analysis based on that version is citable. We add a link to the overall DOI so someone who doesn't care about versions can still cite the data.

Authorship and credit


Now we come to the fun part. The revision will include the input from a bunch of people. This will be recorded on GitHub, but that will only mean something the handful of geeks who think GitHub is awesome. But, let's imagine that we do the following:

  1. Anyone with a GBIF account can link that to their ORCID (if you are a researcher you really should have one of these).
  2. Anyone contributing to this version of the repository gets authorship (appended to the end of the list, so the original provider is first author).
  3. GBIF uses the ORCID API to automatically load the DOI of the new version of the dataset onto the list of works for each contributor. They instantly get credit as an co-author of a citable dataset, and this appears on their ORCID profile.

Benefits



This approach has a number of benefits:
  1. It creates citable data
  2. It gives credit in a way many people will recognise (authorship of a citable work that has a DOI)
  3. The annotations are freely available, there is a complete version history, anyone can contribute at whatever scale suits them.
  4. Anyone can grab the repo at any time and load it into their own system, including the original provider, who can see what people are added to their original data.
  5. There is virtually no programming to do, no new domain-specific protocols, everything is pretty much in place. GitHub does versioning, DataCite does citable identifiers, ORCID handles identify and credit.

Caveats



There are a couple of potential issues. Darwin Core Archive data files can be large, and GitHub can be less effective with large files (although it is ideally suited to the delimited-text files that Darwin Core Archive uses, see Git (and Github) for Data). One approach to impose a limit on the size of an individual "occurrence.txt" file in the archive, so we may have multiple files, none of which is too big. Another task will be linking issues to specific occurrences (if they concern just one occurrence), the GitHub issues will be at level of the complete file. This could be handled in a form-based interface on GBIF that sent the occurrenceID as part of the issue report.

Summary


The key point of this proposal is that everything is in place already to do this. The ducks are lining up, and serious, credible projects are handling the things we need (versioning, identifiers, credit). Sometimes the smart thing is to do nothing and wait to someone else solves the problems you face. I think the waiting may be over.

Tuesday, October 15, 2013

What can Global Biodiversity Information Facility (GBIF) do for you?

I've recently been appointed Chair of the Science Committee of the Global Biodiversity Information Facility (GBIF) http://www.gbif.org [1]. The committee is a small group of people with a range of backgrounds, and one of our roles is to advise GBIF on matters scientific (e.g., what kinds of data GBIF should collect?, what kinds of scientific questions should GBIF help answer?, etc.).

There have been formal surveys (see the papers in the journal "Biodiversity Informatics" https://journals.ku.edu/index.php/jbi/issue/view/370/showToc ), meetings, and a "vision" statement (the "Global Biodiversity Informatics Outlook, http://www.biodiversityinformatics.org/ ). But there's always the chance that these fora may miss some points of view, so I'm keen to get feedback on what sort of things GBIF could do to improve the way it can help people tackle the scientific questions they are interested in.

For example, is there some fundamental limitation that GBIF has that prevents it being useful to you? Is there some feature/data type/geographic coverage/etc. that could be addressed that would make it more useful? Is there a role that GBIF should take on that it hasn't done so? A useful analogy might be to think of the central role GenBank plays in genomics, both as a place to archive your data (sequences), a repository of other people's data that you can access, and a research tool (e.g., BLAST searches to locate similar sequences). Is that the sort of thing you'd want from GBIF, or is it something entirely different?

I'd welcome any comments, suggestions, views, etc. Feel free to add them as comments to this blog, or email me (rdmpage at gmail.com).

I should stress that this is simply me trying to calibrate my perception of GBIF's role with what others think. Also, note if you have specific comments on things such as the GBIF web site please use the feedback tab on the site (that way it will reach the people who can do something about it).

[1] For those unfamiliar with GBIF, its mission "is to make the world's biodiversity data freely and openly available via the Internet". At present the bulk of the data are observations of organisms (mostly multicellular eukaryotes, i.e., animals, plants and fungi) based on either museum collections or observations of living organisms. You can get an idea of the kind of science that uses GBIF-hosted data from this list of papers on Mendeley http://www.mendeley.com/groups/1068301/gbif-public-library/

Updates


Based on responses so far I'll compile a list below of suggestions/themes.

Annotation

  • Have the ability to annotate records (e.g., flag errors) and some mechanism where those annotations get incorporated into GBIF and/or primary data providers.

Dashboard/gap analysis

  • For any search provide information on how complete and/or representative the data is likely to be (for example, are vertebrates over-represented, what is the extent of sampling in this area, etc.).

Geographic coverage

  • Fill big gaps in coverage (e.g., Russia, China, much of the tropics).

Genomics

  • Link GBIF occurrence records to sequences in GenBank

Provenance

  • Who identified specimen?
  • Details on georeferencing (esp. if not GPS)

Data types

  • DNA sequences
  • abundance

Data sources

  • GenBank
  • Literature records (e.g., data mining published papers)
    MEIER, R., & DIKOW, T. (2004). Significance of Specimen Databases from Taxonomic Revisions for Estimating and Mapping the Global Species Diversity of Invertebrates and Repatriating Reliable Specimen Data. Conservation Biology, 18(2), 478–488. doi:10.1111/j.1523-1739.2004.00233.x
  • "Gray" literature, e.g. field books, reports

Identifiers

  • Lack of stable identifiers for occurrences
  • Contributors of specimen data not (yet) in an institution have to mint their own identifiers, with no way of linking those to any future identifier minted by the institution that will eventually house their collection)

Interface

  • Being able to refine taxon search by geographic region
  • Search on any Darwin Core field
  • Wild card search
  • Support for GIS data formats
  • Search using arbitrary bounding polygons (e.g., draw a shape on a map)

Timeliness



Tuesday, October 08, 2013

Which taxonomic journals should be digitised next?

One reason I was able to build BioNames is because a significant fraction of the taxonomic literature for animals is now online, either due to the efforts of the Biodiversity Heritage Library, digital archives, commercial publishers, or individual institutions and scientific societies. However there are still big gaps in literature availability. To get a sense of these gaps I've constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames). If you click on the ISSN in the table you can go to the corresponding page in BioNames to get full details of what BioNames currently knows about that journal.

The journals in red are the ones with the worst online presence (see complete key below). Note that BioNames is still a work in progress so there will be some journals that are online but I've simply not had a chance to add them to BioNames. With that in mind, there are some striking gaps in the digital availability of taxonomic publications. Several Russian journals (collectively publishing thousands of articles) are not online (the story here is somewhat complicated because some Russian journals also have English-language translations available but these are mostly recent articles). A number of large entomological journals are not available (perhaps not surprising given that most described animal taxa are insects).

We can think of this as a "league table" of literature availability. My hope is that digitising projects such as the Biodiversity Heritage Library will look at this and use it to help prioritise which journals to scan. In particular, if the journal is not pre-1923 (and therefore out of US copyright) I hope BHL will then contact the journal's publisher and see if they would be willing to add their journal to those (such as Proceedings of the Biological Society of Washington) that have opened up their complete back catalogue to being scanned by BHL.

I also hope that scientific societies or organisations that publish journals in the "red" or "orange" zones will consider digitising their journals and making their contents accessible to the wider community. We are reaching the point where if knowledge is not online then it effectively doesn't exist.


> 90%Almost all are available
< 90%Most are available
< 50%Limited availability
< 10%Mostly inaccessible
ISSN (click for details)JournalArticlesDigitised% digitised
1175-5326Zootaxa8581818995
0374-5481The Annals and magazine of natural history4463350278
1000-0739880-01 Dong wu fen lei xue bao. Acta zootaxonomica Sinica3403245072
0006-324XProceedings of the Biological Society of Washington3384326396
0022-3360Journal of paleontology3373312193
0037-928XBulletin de la Société entomologique de France30122448
0013-8797Proceedings of the Entomological Society of Washington2972280594
0044-5134Zoologicheskiĭ zhurnal2812161
0044-5231Zoologischer Anzeiger276159422
0022-3395The Journal of parasitology2353222294
0008-347XThe Canadian entomologist2260205991
0003-0082American Museum novitates1942181493
0035-418XRevue suisse de zoologie1851158185
0022-2933Journal of natural history1848182399
0367-1445Entomologicheskoe obozrenie180330
0096-3801Proceedings of the United States National Museum1722136579
0013-872XEntomological news1691161996
0370-2774Proceedings of the Zoological Society of London1580100864
1000-7482880-01 Kun chong fen lei xue bao = Entomotaxonomia1518112774
0037-9271Annales de la Société entomologique de France149775751
0031-031X880-01 Paleontologicheskiĭ zhurnal1472312
0013-8746Annals of the Entomological Society of America1441138396
0035-1814Revue de zoologie et de botanique africaines1400473
0031-0603The Pan-Pacific entomologist1389564
0323-6145Berliner entomologische Zeitschrift / herausgegeben von dem Entomologischen Vereine in Berlin134271053
1148-8425Bulletin du Muséum National d'Histoire Naturelle réunion mensuelle des naturalistes du Muséum130350639
0013-8908The Entomologist's monthly magazine126860
0044-586XAcarologia1226877
0045-8511Copeia1191109592
0031-0239Palaeontology1185115497
0001-6616880-03 Gu sheng wu xue bao = Acta palaeontologica Sinica112700
0165-5752Systematic parasitology1082102895
0454-6296880-01 Kun chong xue bao = Acta entomologica Sinica / Zhongguo kun chong xue hui bian ji105490286
0024-0672Zoologische mededeelingen / uitgegeven vanwege 's Rijksmuseum van Natuurlijke Historie te Leiden103999796
0370-047XProceedings of the Linnean Society of New South Wales103874271
0030-5316Oriental insects103591689
0028-7199Journal of the New York Entomological Society101386085
0521-4726Annales historico-naturales Musei Nationalis Hungarici = Természettudományi Múzeum évkönyve100788688
0070-7279Reichenbachia / Staatliches Museum für Tierkunde in Dresden95120
0022-8567Journal of the Kansas Entomological Society94590696
0373-3491Bollettino della Società entomologica italiana940141
0037-2102Senckenbergiana biologica939111
0002-8320Transactions of the American Entomological Society92379686
0374-9797Nouvelle revue d'entomologie92310
0774-2819Lambillionea91800
0034-7108Revista Brasileira de biologia91661
0007-1595Bulletin of the British Ornithologists' Club91145950
0013-8843Entomologische Zeitschrift88140
0253-116XLinzer biologische Beiträge / Oberösterreiches Landesmuseum87650357
0272-4634Journal of vertebrate paleontology86986499
1217-8837Acta zoologica Academiae Scientiarum Hungaricae86813415
0011-216XCrustaceana865865100
0085-5626Revista brasileira de entomologia86326030
0365-4389Annali del Museo civico di storia naturale "Giacomo Doria."85550359
0097-3157Proceedings of the Academy of Natural Sciences of Philadelphia84850059
0010-065XThe Coleopterists' bulletin83180497
1313-2989ZooKeys827827100
0024-4082Zoological journal of the Linnean Society823821100
0008-4301Canadian journal of zoology81780398
0028-1344The Nautilus81450162
0040-7496Tijdschrift voor entomologie80458072
0375-0434Proceedings of the Royal Entomological Society of London. Series B, Taxonomy79678398
0033-2615Psyche79670989
0164-7954International journal of acarology787786100
0003-0090Bulletin of the American Museum of Natural History77648863
0037-962XBulletin de la Société zoologique de France76522830
0181-0863Revue française d'entomologie76561
1562-0891Wiener Entomologische Zeitung75257376
1000-3118880-01 Gu ji zhui dong wu xue bao74341
0003-0023Transactions of the American Microscopical Society731728100
0075-6547Koleopterologische Rundschau / herausgegeben von der Zoologisch-Botanischen Gesellschaft gemeinsam mit der Forstlichen Bundesversuchsanstalt70633948
0286-9810880-01 The entomological review of Japan = Konchūgaku hyōron7049814
0867-1710Genus69020
0042-3580Venus : Japanese journal of malacology = Kairuigaku zasshi68753177
0067-1975Records of the Australian Museum67962993
0006-6982The Journal of the Bombay Natural History Society6778112
0320-9180Zoosystematica rossica67661
0084-5604Vestnik zoologii / Akademii︠a︡ nauk Ukrainskoĭ SSR, Institut zoologii672376
0387-5733Elytra66610816
0043-0439Journal of the Washington Academy of Sciences66460391
0003-4541Annales zoologici / Polska Akademia Nauk, Instytut Zoologiczny66133651
0016-6995Geobios65947572
0004-2110Arkiv för zoologi / utgivet af K. Svenska vetenskaps-akademien658599
0035-8894Transactions of the Royal Entomological Society of London65549576
0915-5805Japanese journal of entomology64562096
0013-8878The Entomologist645142
0031-1820Parasitology64161496
0007-4853Bulletin of entomological research63361197
0375-099XRecords of the Indian Museum a journal of Indian zoology ed. by the Director, Zoological Survey of India63021334
1326-6756Australian journal of entomology629629100
0018-8158Hydrobiologia627627100
0013-8770880-02 Konchū = Kontyū62561699
0217-2445The Raffles bulletin of zoology62257192
0372-1426Transactions of the Royal Society of South Australia, Incorporated62245072
0079-8835Memoirs of the Queensland Museum62037360
0003-4150Annales de parasitologie humaine et comparée61235558
0018-0130Proceedings of the Helminthological Society of Washington60458897
0015-4040The Florida entomologist602601100
0077-7749Neues Jahrbuch für Geologie und Paläontologie. Abhandlungen60214624
1066-5234The journal of eukaryotic microbiology60157295
0031-0220Paläontologische Zeitschrift6015810
0567-7920Acta palaeontologica Polonica59957896
0032-3780Polskie pismo entomologiczne. Bulletin entomologique de Pologne590285
0027-4100Bulletin of the Museum of Comparative Zoology at Harvard College58144476
0042-3211The Veliger57827447
0181-0626Bulletin du Muséum national d'histoire naturelle. Section A, Zoologie, biologie et écologie animales57456498
0068-547XProceedings of the California Academy of Sciences57326146
0035-6387Rivista di parassitologia56620
0003-5092Annotationes zoologicae Japonenses / auspiciis Societatis Zoologicae Tokyonensis seriatim editae = Nihon dōbutsugaku ihō56254597
0036-7575Mitteilungen der Schweizerischen entomologischen Gesellschaft = Bulletin de la Société entomologique suisse56231
0251-074XRevue de zoologie africaine560183
0373-9465Folia entomologica Hungarica = Rovartani közlemények55561
0206-0477880-01 Trudy Zoologicheskogo instituta = Travaux de l'Institut zoologique de l'Académie des sciences de l'URSS / Akademii︠a︡ nauk Soi︠u︡za Sovetskikh Sot︠s︡ialisticheskikh Respublik55420
1445-5226Invertebrate systematics550550100
0026-2803Micropaleontology54840975
0307-6970Systematic entomology53752698
0020-1804Insecta matsumurana53651496
0278-0372Journal of crustacean biology : a quarterly of the Crustacean Society for the publication of research on any aspect of the biology of crustacea531531100
0165-0424Aquatic insects525525100
1051-8932Bulletin of the Brooklyn Entomological Society52331
0013-8711Entomologica scandinavica52251398
0341-8391Spixiana51546590
0013-8789Journal of the Entomological Society of Southern Africa51539276
0018-0831Herpetologica51447292
0323-7087Zoologische Jahrbücher. Abteilung für Systematik, Geographie und Biologie der Tiere51317634
0007-4977Bulletin of marine science51039778
0250-4413Entomofauna50038777