Tuesday, October 21, 2014

On identifiers (again)

I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out.

From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.

Discoverability (strings to things)

A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool http://search.crossref.org. Without discoverabilty, nobody is going to find the identifiers in the first place.


Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.

Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.

Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like http://dx.doi.org for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.

Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.


Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.

The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.

One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.

Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.

Network effects

Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.

If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).

So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).


I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.

Thursday, October 02, 2014

BioStor and JournalMap: a geographic interface to articles from the Biodiversity Heritage Library

The recent jump from ~11000 to ~17000 articles in JournalMap is mostly due to JournalMap ingesting content from my BioStor database. BioStor extracts articles from the Biodiversity Heritage Library (BHL), and in turn these get fed back into BHL as "parts" (you can see these in the "Table of Contents" tab when viewing a scanned volume in BHL).

In addition to extracting articles, BioStor pulls out latitude and longitude pairs mentioned in the OCR text and creates little Google Maps for articles that have geotagged content. Working with Jason Karl (@jwkarl), JournalMap now talks to BioStor and grabs all its geotagged articles so that you can browse them in JournalMap. As a consequence, journals such as Proceedings of The Biological Society of Washington now appear on their map (this journal is third most geotagged journal in JournalMap).

As an example of what you can do in JournalMap, here's a screenshot showing localities in Tanzania, and an article from BioStor being displayed:

JournalMap is an elegant interface to the biodiversity literature, and adding BioStor as a source is a nice example of how the Biodiversity Heritage Library's content is becoming more widely used. BioStor would only be possible if BHL made its content and metadata available for easy downloading. This is a lesson I wish other projects would learn. Instead of focussing on building flash-looking portals, make sure (a) you have lots of content, and (b) make it easy for developers to get that content so they can do cool things with it. BHL does well in this regard — other projects, such as BHL-Europe, not so much.

Tuesday, September 23, 2014

Exploring the chameleon dataset: broken GBIF links and lack of georeferencing

Following on from the discussion of the African chameleon data, I've started to explore Angelique Hjarding's data in more detail. The data is available from figshare (doi:10.6084/m9.figshare.1141858), so I've grabbed a copy and put it in github. Several things are immediately apparent.

  1. There is a lot of ungeoreferenced data. With a little work this could be geotagged and hence placed on a map.
  2. There are some errors with the georeferenced data (chameleons in Soutb America or off the coast, a locality in Tanzania that is now in Ethiopia, etc.).
  3. Rather alarmingly, most of the URLs to GBIF records that Angelique gives in the dataset no longer resolve.

The last point is worrying, and reflects the fact that at present you can't trust GBIF occurrence URLs to be stable over time. Most of the specimens in Angelique's data are probably still in GBIF, but the GBIF occurrenceID (and hence URL) will have changed. This pretty much kills any notion of reproducibility, and it will require some fussing to be able to find the new URLs for these records.

That the GBIF occurrenceIDs are no longer valid also makes it very difficult to make use of any data cleaning I or anyone else attempts with this data. If I georeference some of the specimens, I can't simply tell GBIF that I've got improved data. Nor is it obvious how I would give this information to the original providers using, say VertNet's github repositories. All in all a mess, and a sad reflection on our inability to have persistent identifiers for occurrences.

To help explore the data I've created some GeoJSON files to get a sense of the distribution of the data. Here are the point localities, a few have clearly got issues.

I also drew some polygons around points for the same taxon, to get a sense of their distributions.

Taxa represent by less than three distinct localities are presented by place marker, the rest by polygons.

I'll keep playing with this data as time allows, and try to get a sense of how hard it would be to go from what GBIF provides to what is actually going to be useful.

Monday, September 22, 2014

GBIF Science Committee Report slides #gb21

FullSizeRenderJust back from GB21, the GBIF Governing Board meeting (the first such meeting I've attended). It was in New Delhi, and this was also my first time in india, which is an amazing place. At some point I may blog about the experience: the heat, the sheer number of people, the juxtaposition of wealth and poverty, the traffic (chaotic in a wonderfully self-organising sort of way), seeing birds of prey wheel overhead around a hotel in a major city, followed by fruit bats skimming the trees in the evening, the joys of haggling with tuk-tuk drivers, and the wonder that is the Taj Mahal.

Lots to also think about regarding the meeting. A somewhat unsatisfactory conversation about licensing started on Twitter, so at some point I need to revisit this.

But for now, here are the slides from my summary of the GBIF Science Committee's activities. It discusses the forthcoming Ebbe Nielsen Challenge (details still be worked on so the slides are not the final word), the challenges of adding sequence data to GBIF, and the much-discussed case of the chamaeleons.

Thursday, August 28, 2014

BioNames database can be downloaded

B8e253dc3be3d84f2c69c51b0af86c03 400x400My BioNames project has been going for over a year now, but I hadn't gotten around to providing bulk access to the data I've been collecting and cleaning. I've gone some way towards fixing this. You can now grab a snapshot of the BioNames database as a Darwin Core Archive here. This snapshot was generated on the 22nd August, so it is already a little out of date (BioNames is edited almost daily as I clean and annotate it when I should be doing other things).

The data dump doesn't capture all the information in the BioNames as I've tried to keep it simple, and Darwin Core is a bit of a pain to deal with. The actual database is in CouchDB which is (mostly) an absolute joy to work with. I replicate the database to Cloudant, which means there's a copy "in the cloud". A number of my other CouchDB projects are also in Cloudant, in the case of Australian Faunal Directory and BOL DNA Barcode Map the data is also served directly from Cloudant.

Monday, August 25, 2014

Geotagging stats for BioStor

PlaceMarkNote to self for upcoming discussion with JournalMap.

As of Monday August 25th, BioStor has 106,617 articles comprising 1,484,050 BHL pages. From the full text for these articles, I have extracted 45,452 distinct localities (i.e., geotagged with latitude and longitude). 15,860 BHL pages in BioStor pages have at least one geotag, these pages belong to 5,675 BioStor articles.

In summary, BioStor has 5,675 full-text articles that are geotagged. The largest number of geotags for an article is 2,421, for Distribución geográfica de la fauna de anfibios del Uruguay (doi:10.5479/si.23317515.134.1).

The SQL for the queries is here.

Tuesday, August 19, 2014

Guest post: Response to the discussion on Red List assessments of East African chameleons

AHjardingThis is guest post by Angelique Hjarding in response to discussion on this blog about the paper below.
Hjarding, A., Tolley, K. A., & Burgess, N. D. (2014, July 10). Red List assessments of East African chameleons: a case study of why we need experts. Oryx. Cambridge University Press (CUP). doi:10.1017/s0030605313001427
Thank you for highlighting our recent publication and for the very interesting comments. We wanted to take the opportunity to address some of the issues brought up in both your review and from reader comments.

One of the most important issues that has been raised is the sharing of cleaned and vetted datasets. It has been suggested that the datasets used in our study be uploaded to a repository that can be cited and shared. This is possible for data that was downloaded from GBIF as they have already done the legwork to obtain data sharing agreements with the contributing organizations. So as long as credit is properly given to the source of the data, publicly sharing data accessed through GBIF should be acceptable. At the time the manuscript was submitted for publication, we were unaware of sites such as http://figshare.com where the data could be stored and shared with no additional cost to the contributor. The dataset used in the study that used GBIF data has now been made available in this way.
Angelique Hjarding. (2014). Endemic Chameleons of Kenya and Tanzania. Figshare. doi:10.6084/m9.figshare.1141858

It starts to get tricky with doing the same for the expert vetted data. This dataset consists primarily of data gather by the expert from museum records and published literature. So in this case it is not a question of why the expert doesn’t share. The question is why the museum data and any additional literature records are not on GBIF already. As has been pointed out in our analysis (and confirmed by Rod) most of these museums do not currently have data sharing agreements with GBIF. Therefore, the expert who compiled the data does not have the permission of the museums to share their data second hand. Bottom line, all of the data used in this study that was not accessed through GBIF is currently available from the sources directly. That is, for anyone who wants to take the time contact the museums for permission to use their data for research and to compile it. We also do not believe there is blame on museums that have not yet shared their data with forums such as GBIF. Mobilisation of data is an enormous task, and near impossible if funds and staff are not available. With regards to the particular comment regarding the lack of data sharing by NHML and other museums, we need to recognise what the task at hand would mean, and rather address ways such a monumental, and valuable, collection could be mobilised. A further issue should be raised around literature records that are not necessarily encapsulated in museum collections, but are buried in old and obscure manuscripts. To our knowledge, there is no way to mobilise those records either, because they are not attached to a specimen. Further, because there are no specimens, extreme care must be taken if such records were to be mobilised in order to ensure quality control. Again, assistance of expert knowledge would be highly beneficial, yet these things take time and require funds.

Another issue that was raised is why didn’t we go directly to GBIF to fix the records? The point of our research was not to clean and update GBIF/museum data but to evaluate the effect of expert vetting and museum data mobilization in an applied conservation setting. As it has been pointed out, the lead author was working at GBIF during the course of the research. An effort was made to provide a checklist of the updated taxonomy to GBIF at the time, but there was no GBIF mechanism for providing updates. This appears to still be the case. In addition, two GBIF staff provided comments on the paper and were acknowledged for their input. We are happy to provide an updated taxonomy to help improve the data quality, should some submission tool for updates be made available.

Finally we would like to address the question, why use GBIF data if we know it needs some work before it can be used? We believe this is a very important debate for at least two reasons. First, when data is made public, we believe there are many researchers who work under the assumption that the data is ready for use with minimal further work. We believe they assume that the taxonomy is up to date; that the records are in the right place; and that the records provided relate to the name that is attached to those records. Many of the papers that have used GBIF data have undertaken broad scale macroecological analyses where, perhaps, the errors we have shown matter little. But some of these synthetic studies have also proposed that their results can be used for decision making by companies, which starts to raise concerns especially if the company wants to know the exact species that its activities could impact. As we have shown, for chameleons at least, such advice would be hard to provide using the raw GBIF data.

Second, we are aware that there is another group of researchers using GBIF data who "know that to use GBIF's data you need to do a certain amount of previous work and run some tests, and if the data does not pass the tests, you don't use it." We are not sure of the tests that are run, and it would be useful to have these spelled out for broader debate and potentially the development of some agreed protocols for data cleaning for various uses.

Our underlying reason for writing the paper was not to enter into debate of which data are best between GBIF and an expert compiled dataset. We are extremely pleased that GBIF data exist, and are freely available for the use of all. This certainly has to be part of the future of 'better data for better decisions', but we are concerned that we should not just accept that the data is the best we can get, but should instead look for ways to improve it, for all kinds of purposes. As such, we would like to suggest that the discussion focuses some energy on ways to address the shortcomings of the present system, but also that the community who would benefit from the data address ways to assist the dataholders to mobilise their information in terms of accessing the resources required to digitise and make data available, and maintain updated taxonomy for their holdings. In an era of declining funding for Museum-based taxonomy in many parts of the world this is certainly a challenge that needs to be addressed.

We welcome further discussion as this is a very important topic, not only for conservation but also in terms of improved access to biodiversity knowledge, which is critical for many reasons.

Angelique Hjarding http://orcid.org/0000-0002-9279-4893
Krystal Tolley
Neil Burgess