Tuesday, July 28, 2015

Modelling taxonomic names in databases

Quick notes on modelling taxonomic names in databases, as part of an ongoing discussion elsewhere about this topic.

Simple model

One model that is widely used (e.g., ITIS, WoRMS) and which is explicit in Darwin Core Archive is something like this:


We have a table for taxa and we don't distinguish between taxa and their names. the taxonomic hierarchy is represented by the parentID field, which points to your parent. If you don't have a (non NULL) value for parentID you are not an accepted taxon (i.e., you are a synonym), and the field acceptedID points to the accepted taxon. Simple, fits in a single database table (or, let's be honest, and Excel spreadsheet).

The tradeoff is that you conflate names and taxa, you can't easily describe name-only relationships (e.g., homonyms, nomenclatural synonyms) without inventing "taxa" for each name.

Separating names and taxa

The next model, which I've drawn rather clunky below as if you were doing this in a relational database, is based on the TDWG LSID vocabularies. One day someone will explain why the biodiversity informatics community basically ignored this work, despite the fact that all the key nomenclators use it.


In this model we separate out names as first-class objects with globally unique identifiers. The taxa table refers to the names table when it mentions a name. Any relationships between names are handled separately from taxa, so we can easily handle things like replacement names for homonyms, basionyms, etc. Not that we can also remove a lot of extraneous stuff from the taxa table. For example, if we decide that Poissonia heterantha is the accepted name for a taxon, we don't need to create taxa for Coursetia heterantha or Tephrosia heterantha, because by definition those names are synonyms of Poissonia heterantha.

The other great advantage of this model is that it enables us to take the work the nomenclators have done straight without having to first shoe-horn it into the Darwin Core format, which assumes that everything is a taxon.

Monday, July 27, 2015

The Biodiversity Data Journal is not machine readable

626ce1b38c2b42f77802e4e8c597820e 400x400In my (previous post ) I discussed the potential for the Biodiversity Data Journal (BDJ) to be a venue for nano (or near-nano publications). In this post I want to draw attention to what I think is a serious stumbling block, which is the lack of machine readable statements in the journal.

Given that the journal is probably the most progressive in the field (indeed, is suspect that there are few journals in any field as advanced in publishing technology as BDJ) this may seem an odd claim to make. The journal provides XML of its text, and typically provides data in Darwin Core Archive format, which is harvested by GBIF. The article XML is marked up to flag taxonomic names, localities, etc. Surely this is the very definition of machine readable?

The problem becomes apparent when you ask "what claims or assertions are the papers making?", and "how are those assertions reflected in the article XML and/or the Darwin Core Archive?".

For example, consider the following paper:

Gil-Santana, H., & Forero, D. (2015, June 16). Aristathlus imperatorius Bergroth, a newly recognized synonym of Reduvius iopterus Perty, with the new combination Aristathlus iopterus (Perty, 1834) (Hemiptera: Reduviidae: Harpactorinae) ​. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e5152

The title gives the key findings of this paper: Aristathlus imperatorius = Reduvius iopterus, and Reduvius iopterus = Aristathlus iopterus. Yet these statements are no where to be found in the Darwin Core Archive for the paper, which simply lists the name Aristathlus iopterus. The XML markup flags terms as names, but says nothing about the relationships between the names.

Here is another example:

Starkevich, P., & Podenas, S. (2014, December 30). New synonym of Tipula (Vestiplex) wahlgrenana Alexander, 1968 (Diptera: Tipulidae). BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.2.e4237
Indeed, I've yet to find an example of a paper in BDJ where a synonomy asserted in the text is reflected in the Dawrin Core Archive!

The issue here is that neither the XML markup nor the associated data files are capturing the semantics of the paper, in the sense of what the paper is actually saying. The XML and DwCA files capture (some) of the names, and localities mentioned, but not the (arguably) most crucial new pieces of information.

There is a disconnect between what the papers are saying (which a human reader can easily parse) and what the machine-readable files are saying, and this is worrying. Surely we should be ensuring that the Darwin Core Archives and/or XML markup are capturing the key facts and/or assertions made by the paper? Otherwise databases down stream will remain none the wiser about the new information the journal is publishing.

Nanopublications and annotation: a role for the Biodiversity Data Journal?

626ce1b38c2b42f77802e4e8c597820e 400x400

I stumbled across this intriguing paper:

Do, L., & Mobley, W. (2015, July 17). Single Figure Publications: Towards a novel alternative format for scholarly communication. F1000Research. F1000 Research, Ltd. http://doi.org/10.12688/f1000research.6742.1
The authors are arguing that there is scope for a unit of publication between a full-blown journal article (often not machine readable, but readable) and the nanopublication (a single, machine readable statement, not intended for people to read), namely the Single Figure Publications (SFP):

The SFP, consisting of a figure, the legend, the Material and Methods section, and an optional Results/Discussion section, reduces the unit of publication to a more tractable size. Importantly, it results in a markedly decreased time from data generation to publication. As such, SFPs represent a new means by which to communicate scientific research. As with the traditional journal article, the content of the SFPs is readily understandable by the scientist. Coupled with additional tools that aid in structuring content (e.g. describing in detail the methods using pre-defined steps from protocols), the SFP represents a “bottom-up” means by which scholars can structure the content of their findings in a modular and piece-wise fashion wedded to everyday laboratory life.
It seems to me that this is something that the Biodiversity Data Journal is potentially heading towards. Some of the papers in that journal are short, reporting say, new occurence records for a single species e.g.:

Ang, Y., Rohner, P., & Meier, R. (2015, June 26). Across the Baltic: a new record for an enigmatic black scavenger fly, Zuskamira inexpectata (Pont, 1987) (Sepsidae) in Finland. BDJ. Pensoft Publishers. http://doi.org/10.3897/bdj.3.e4308

Imagine if we have even shorter papers that are essentially a series of statements of fact, or assertions (linked to supporting evidence). These could potentially be papers that annotated and/or clarified data in an external database, such as GBIF. For example, let's imagine we find two names in GBIF that GBIF treats as being different taxa, but a recent publication asserts are actually synonyms. We could make that information machine readable (say, using Darwin Core Archive format), link it to the source(s) of the assertion (i.e., the DOI of the paper making the synonymy), then publish that as a paper. As the Darwin Core Archive is harvested by GBIF, GBIF then has access to that information, and when the next taxonomic indexing occurs it can make use of that information.

One reason for having these "micropublications" is that sometimes resolving an issue in a dataset can take some time. I've often found errors in databases and have ended up spending a couple of hours finding names, literature, etc. to figure out what is going on. As fun as that is, in a sense it's effort that is wasted if it's not made more widely available. But if I can wrap that couple of hours scholarship into a citable unit, publish it, and have it harvested and incorporated into, say, GBIF, then the whole exercise seems much more rewarding. I get credit for the work, and GBIF users get (hopefully) a tiny bit of improvement, and they can see the provenance of that improvement (i.e., it is evidence-based).

This seems like a simple mechanism for providing incentives for annotating databases. In some ways the Biodiversity Database Journal could be though of as doing this already, however as I'll discuss in the next blog post, there's an issue that is preventing it being as useful as it could be.

Thursday, July 23, 2015

Purposeful Games and the Biodiversity Heritage Library


These are some quick thoughts on the games on the BHL site, part of the Purposeful Gaming and BHL Project. As mentioned on Twitter, I had a quick play of the Beanstalk game and got bored pretty quickly. I should stress that I'm not a gamer (although my family includes at least one very serious gamer, and a lot of casual players). Personally, if I'm going to spend a large amount of time with a computer I want to be creating something, so gaming seems like a big time sink. Hence, I may not be the best person to review the BHL games. Anyhow...

It seems to me that there are a couple of ways games like this might work:

  1. You want to complete the game so you can do something you really want to do. This is the essence of solving CAPTCHAs, I'll solve your stupid puzzle so that I can buy my tickets.
  2. The game itself is engaging, and what you are asked to do is a natural part of the game's world. When you swipe candy, bits of candy may explode, or fall down (this is a good thing, apparently), or when you pull back a slingshot and release the bird, you get to break stuff).

The BHL games are trying to get you to do one activity (type in the text shown in a fragment of a BHL book) and this means, say, a tree grows bigger. To me this feels like a huge disconnect (cf. point 2 above), there is no connection between what I'm doing and the outcome.

Worse, BHL is an amazing corpus of text and images, and this is almost entirely hidden from me. If I see a cool looking word, or some old typeface, there's no way for me to dig deeper (what text did that come from?, what does that phrase mean?). I get no sense of where the words come from, or whether I'm doing anything actually useful. For things like ReCAPTCHA (where you helped OCR books) this doesn't matter because I don't care about the books, I want my tickets. But for BHL I do care (and BHL should want at least some of the players to care as well).

So, remembering that I'm not a gamer, here are some quick ideas for games.

Find that species

One reason BHL is so useful is it contains original taxonomic descriptions. Sometimes the OCR is too poor for the name to extracted from the description. Imagine a game where the player has a list of species (with cute pictures) and is told to go find them in the text. Imagine that we have a pretty good idea where they are (from bibliographic data we could, for example, know the page the name should occur on), the player hunts for the word on the page, and when they find it and mark it. BHL then gets corrected text and confirmation that the name occurs on that page. Players could select taxa (e.g., birds, turtles, mosses) that they like.

Find lat/longs

BHL text is full of lat/long pairs, often the OCR is not quite good enough to extract them. Imagine that we can process BHL to find things that look like lat/long pairs. Imsgine that we can read enough of the text to get a sense of where in the world the text refers to. Now, have a game where we pick a spot on a map and find things related to that spot. Say we get presented with OCR text that may refer to that locality, we fix it, and the map starts get populated. A bit like Yelp and Four Square, we could imagine badges for the most articles found about a place.

Find the letter/font

There are lots of cool symbols and fonts in BHL, someone might be interested collecting these. Simple things might be diphthongs such as æ. Older BHL texts are full of these, often misinterpreted. Other examples are male and female symbols. Perhaps we could have a game where we try and guess what symbol the OCR text actually matches - in other words, show the OCR text first, player tries to guess actual symbol, then the image appears, and then player types in actual symbol. Goal is to get good at predicting OCR errors.

Games like this would really benefit if the player could see (say, on the side) the complete text. Imagine that you correct a word, then you see it comes from a gorgeous plate of a bird. Imagine you could then correct any of the there words on that page.

Word eaters

Imagine the layer is presented with a page with text and, a bit like Minecraft's monsters, things appear which start to "eat" the words. You need to check as many words as possible before the text is eaten. Perhaps structure things in such a way that checked words form a barrier to the word-eating creatures and buy you some time, or like Minecraft, fixing a bad OCR word blasts a radius free of the word eaters. As an option (again, like Minecraft) turn off the eaters and just correct the words at your leisure.


Based on the UK game show, present a set of random letters (as images), player makes longest word they can, then check against dictionary, this tells you what letters they think the images represent.

Falling words

Have page fragments fall from the top of the screen, and have a key word displayed (say, "sternum", or enable player to type a word in) then display images of words whose OCR text resembles this (in other words, have a bunch of OCR text indexed using methods that allow for errors). As the word images fall, the player taps on an image that matches the word and they are collected. Maybe add features such as a timeline to show when the word was used (i.e., the date of the BHL text), give the meaning of the word, lightly chide players who enter words like "f**k" (that'd be me), etc.


Like comedy, I imagine that designing games is really, really hard. But the best games I've seen create a world that the player is immersed in and which makes sense within the rules of that world. Regardless of whether these ideas are any good, my concern is that the BHL games seem completely divorced from context, and the game play bears no relation to outcomes in the game.

Wednesday, July 22, 2015

Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data

JSTOR Logo RGB 60x76Browsing JSTOR's Global Plants database I was struck by the number of comments people have made on individual plant specimens. For example, for the Holotype of Scorodoxylum hartwegianum Nees (K000534285) there is a comment from Håkan Wittzell that the "Collection number should read 1269 according to Plantae Hartwegianae". In JSTOR the collection number is 1209.

Now, many (if not all) of these specimens will also be in GBIF. Indeed, K000534285 is in GBIF as http://www.gbif.org/occurrence/912442645, also with collection number 1209. A GBIF user will have no idea that there is some doubt about one item of metadata about this specimen.

So, an obvious thing to do would be to make the link between the JSTOR and GBIF records. Implementing this would need so fussing because (sigh) unlike DOIs for articles we don't have agreed upon identifiers for specimens. So we'd need to do some mapping between the specimen barcode K000534285, the JSTOR URL http://plants.jstor.org/stable/10.5555/al.ap.specimen.k000534285, and the GBIF record http://www.gbif.org/occurrence/912442645.

In addition to providing users with more information, it might also be useful in kickstarting annotation on the GBIF site. At the moment GBIF has no mechanism for annotating data, and if it did, then it would have to start from scratch. Imagine that a person visiting occurrence 912442645 sees that it has already attracted attention elsewhere (e.g., JSTOR). They might be encouraged to take part in that conversation (because at least one person cared enough to comment already). Likewise, we could feed annotations on the GBIF site to JSTOR.

A variation on this idea is to think of annotations such as those in the JSTOR database as being analogous to the tweets, blog posts, and bookmarking that altmetric tracks for academic papers. Imagine if we applied the same logic to GBIF and had a way to show users that a specimen has been commented on in JSTOR Plants? Thinking further down the track, we could image adding other sorts of "attention", such as citations by papers, vouchers for DNA sequences, etc.

It would be a fun project to see whether the Disqus API enabled us to create a tool that could match JSTOR Global Plants comments to GBIF occurrences.

Steve Baskauf on RDF and the "Rod Page Challenge"

Rdf w3c icon 128Steve Baskauf has concluded a thoughtful series of blog posts on RDF and biodiversity informatics with http://baskauf.blogspot.co.uk/2015/07/confessions-of-rdf-agnostic-part-7.html. In this post he discussed the "Rod Page Challenge", which was a series of grumpy posts I wrote (starting with this one) where I claimed RDF basically sucked, and to illustrate this I issued a challenge for people to do something interesting with some RDF I provided. Since this RDF didn't have a stable home I've put it on GitHub and it has a DOI http://dx.doi.org/10.5281/zenodo.20990 courtesy of GitHub's integration with Zenodo.

I argued that the RDF typically available was basically useless because it wasn't adequately linked (see Reflections on the TDWG RDF "Challenge"). Two of the RDF files I provided were created specifically created to tackle this problem (derived from my projects iPhylo Linkout http://dx.doi.org/10.1371/currents.RRN1228 and the precursor to BioNames http://dx.doi.org/10.7717/peerj.190). This marked pretty much the end of any interest I had in pursuing RDF.

Towards the end of Steve's post he writes:

At the close of my previous blog post, in addition to revisiting the Rod Page Challenge, I also promised to talk about what it would take to turn me from an RDF Agnostic into an RDF Believer. I will recap the main points about what I think it will take in order for the Rod Page Challenge to REALLY be met (i.e. for machines to make interesting inferences and provide humans with information about biodiversity that would not be obvious otherwise):

  1. Resource descriptions in RDF need to be rich in triples containing object properties that link to other IRI-identified resources.
  2. "Discovery" of IRI-identified resources is more likely to lead to interesting information when the linked IRIs are from Internet domains controlled by different providers.
  3. Materialized entailed triples do not necessarily lead to "learning" useful things. Materialized entailed triples are useful if they allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.

Steve's point 1 is essentially the point I was making with the challenge. At the time of the challenge, RDF from major biodiversity informatics projects was in silos, with few (if any) links to external resources (the kinds of things Steve refers to in his point 2). As a result, the promised benefits from RDF simply haven't materialised. The lesson I took from this is that we need rich, dense cross-links between data sources (the "biodiversity knowledge graph"), and that's one reason I've been obsessed with populating BioNames, which links animal names to the primary literature (I'm planning to extend this to plants as well). Turns out , creating lots of cross links is really hard work, much harder than simply pumping out a bunch of RDF and waiting for it to automagically coalesce into an all-connected knowledge graph.

I posed the challenge back in 2011, and since then I think the landscape has changed to the extent that I wonder if trying to "fix" RDF is really the way forward.

XML is dead

Anyone (sane) developing for the web and wanting to move data around is using JSON, XML is hideous and best avoided. Much of the early work on RDF used XML, which only made things even harder than they already were. JSON beats XML, to the extent that RDF itself now has a JSON serialisation, JSON-LD. But JSON-LD is about more than the semantic web (see JSON-LD and Why I Hate the Semantic Web), and has the great advantage that you can actually ignore all the RDF cruft (i.e., the namespaces) and simply treat the data as key-value pairs (yay!). Once you do that, then you can have fun with the data, especially with databases such as CouchDB ("fun" and "database" in the same sentence, I know!).

Key-value pairs, document stores, and graph databases

The NoSQL "movement" has thrown up all sorts of new ways to handle data and to think about databases. We can think of RDF as describing a graph, but it carries the burden of all the namespaces, vocabularies, and ontologies that come with it. Compare that with the fun (there's that word again) of graph databases such as Neo4J with its graph gists. The Neo4J folks have made a great job of publicising their approach, and making it easy and attractive to play with.

So, we're in a interesting time when there are a bunch of technologies available, and I think maybe it's time to ask whether the community's allegiance to RDF and the Semantic Web has been somewhat misplaced...

Thursday, June 25, 2015

Biodiversity Data Journal data lost on the way to GBIF and EOL

Two ongoing challenges in biodiversity informatics are getting data into a form that is usable, and linking that data across different projects platforms. A recent and interesting approach to this problem are "data journals" as exemplified by the Biodiversity Data Journal. I've been exploring some data from this journal that has been aggregated by GBIf and EOL, and have come across a few issues. In this post I'll firstly outline the standard format for moving data between biodiversity projects, the Darwin Core Archive, then illustrate some of the pitfalls.

Darwin Core Archive

Firstly a quick digression on the Darwin Core Archive format, which has a few gotchas for newcomers to the format (such as myself). The Darwin Core Archive supports a "star schema" like this.


At the centre of the star is a table containing data either about taxa or occurrences. We can have additional tables with other sorts of data, and we also have a meta.xml file which tells us what all the data columns are and how the different tables are related to the core table.

For example, if we have taxa as our core, then we can have a table like this were each taxon has a unique taxon_id:

taxon_idtaxon stuff

Now, imagine that we have a reference for each of these taxa (say it's the paper that originally described these species). Then we could add a unique identifier for that reference reference_id to the taxon table:

taxon_idreference_idtaxon stuff

Now, if we were building a relational database we could have a separate table for the references, and link the two table using the reference_id as a primary key for the references and as a foreign key in the taxon table, like this:

reference_idreference stuff

This means that we need only have the reference stored once, which means there's no redundancy. If we need to update the reference data, we only need to do it once.

However, this is not how Darwin Core Archive works. Because it's a star schema, we need to have a references table like this:

reference_idtaxon_idreference stuff

Note that we have added the taxon_id to link the reference to each taxon, and that the same reference occurs three times (once for each taxon it refers to), hence we have redundancy. Note also that if we don't include the taxon_id key then there's no way for a Darwin Core Archive reader to link the reference to the corresponding taxa (we'll come back to this below).

I've said that the reference are in their own table. In fact, we can have everything in one big table, and use the meta.xml table to tell a Darwin Core Archive reader to process that same table but extract different data each time (the Mammal Species of the World checklist http://doi.org/10.15468/csfquc is an example of this). Hence, we could extract taxon_id and taxon stuff for the taxa, then reference_id, reference stuff for the references.

taxon_idreference_idtaxon stuffreference stuff

The other thing to remember is that the meta.xml file is responsible for describing the data. It does this in two ways (1) it defines the type of data a given table contains (e.g., taxa, occurrence, image, etc.), and (2) it defines what each column in the data represents, using a controlled vocabulary.

The type of data each table contains is defined by a URI, and the list of these "registered extensions" is available from GBIF. The two "core" extensions are for taxa and occurrences, the two things GBIF primarily deals with, while the other extensions enable richer data to be added. Of course, a Darwin Core Archive consumer that doesn't understand these extensions can simply ignore them. Rather unfortunately, some extensions, such as the EOL media and references extensions overlap with the GBIF multimedia and references extensions. Hence, if you have, say images or bibliographic data, you have two extensions to choose from. If you choose EOL's then EOL will import your data, but GBIF won't. Furthermore, the extensions vary in richness. If you have bibliographic data then GBIF's vocabulary for references looks sparse and lacking many of the fields one might expect, whereas EOL's is quite rich.

Problems with Biodiversity Data Journal and GBIF

With that background, let's take a look at what happens to Biodiversity Data Journal (BDJ) data once it enters GBIF. For example, the species Eupolybothrus cavernicolus, described using "transcriptomic, DNA barcoding and micro-CT imaging data" (http://dx.doi.org/10.3897/BDJ.1.e1013). Data from this paper is in GBIF as both an occurrence dataset (http://doi.org/10.15468/zpz4ls) and checklist dataset (http://doi.org/10.15468/rpavbl).


The checklist dataset includes both media and references. The images don't appear in GBIF, but are visible in EOL (e.g., http://eol.org/data_objects/26558840 shown below:

36163 orig Because the type for the media is set to a type (http://eol.org/schema/media/Document) that only EOL recognises, GBIF doesn't harvest the images, and hence misses out on all this extra multimedia goodness.


The references in the BDJ dataset don't appear in either GBIF or EOL (see http://eol.org/pages/38177334/literature). Presumably they don't appear in GBIF because BDJ uses EOL's extension, but why don't they appear in EOL? Looking at the raw data, the references.csv file in the Darwin Core lacks the coreid field needed to link the references to the corresponding taxon (the fiels is defined in the meta.xml file, but there is no corresponding column in the references.csv file. Looking at other BDJ Darwin Core Archives this seems to be a common problem.


Strangely the BDJ paper shows a map with a point locality, but the same data in GBIF does not (see http://doi.org/10.15468/zpz4ls). Mapbdj

A look at the occurrences.csv shows that the file has verbatim latitude and longitude but not decimal versions of the coordinates, which is what GBIF uses to locate records on the map. So the BDJ data set isn't contributing any geographical data. Clearly a lot of BDJ data is georeferenced (see map), but not this example.


The centipede Eupolybothrus cavernicolus is not in GBIF's backbone classification. This is a common issue, especially with newly described taxa. GBIF does not have access to recent nomenclatural data, and so even though the BDJ data comes with a ZooBank LSID urn:lsid:zoobank.org:act:6F9A6F3C-687A-436A-9497-70596584678C for the name Eupolybothrus cavernicolus, GBIF itself doesn't know about and so if you do a default search on the name Eupolybothrus cavernicolus you get only the genus.


Here are the issues I uncovered after a little bit of messing about:

  1. BDJ Darwin Core Archives don't support extensions recognised by GBIF.
  2. BDJ references lack the coreid for the taxa/occurrences and hence are not ingested by Darwin Core readers.
  3. BDJ does not seem to parse and interpret verbatim coordinates when generating Darwin Core Archives.
  4. GBIF doesn't support the extensions output by BDJ.
  5. GBIF's references extension is woefully inadequate for handling bibliographic metadata.
  6. GBIF's list of taxonomic names is woefully out of date.

What both puzzles and frustrates me is that a much trumpeted collaboration between these projects has significant problems which seem to have gone undetected. It seems as if it is enough to have a pipeline between a data journal and a project, without actually testing whether that pipeline loses or misrepresents the data. In some cases, very little of the data in a BDJ archive actually makes it into GBIF, which is wasteful and rather defeats the point of having a data journal to database pipeline in the first place.

Wednesday, June 24, 2015

Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit

Man03gTw 400x400I spent last Friday and Saturday at (Research in the 21st Century: Data, Analytics and Impact, hashtag #ReCon_15) in Edinburgh. Friday 19th was conference day, followed by a hackday at CodeBase. There's a Storify archive of the tweets so you can get a sense of the meeting.

Sitting in the audience a few things struck me.

  1. No identifier wars, DOIs have won and are everywhere.
  2. GitHub is influencing the way we do science, but we've much still to learn.
  3. ORCIDs are gaining traction.
  4. Nobody really understands "impact".


GitHub is becoming more and more important, not only as a repository of scientific code and data, but as a useful model of sorts of things we need to be doing. Arron Smith gave a fascinating talk on GitHub. Apart from the obvious things such as version control, Arfon discussed the tools and mindset of open source programmers, and who that could be applied to scientific data. For example, software on GitHub is often automatically tested for bugs (and GitHub displays a badge saying whether things are OK). Imagine doing this for a data set, having it automatically checked for errors and/or internal consistency. Reproducibility is a big topic in science, but open source software has to be reproducible by default in the sense that it has to be able to be downloaded and compiled on a user's computer. This is just a couple of the things Arfon covered, see his slides for more.

Transitive Credit

One idea which particularly struck me was that of "transitive credit":

Katz, D. S. (2014, February 10). Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. JORS. Ubiquity Press, Ltd. http://doi.org/10.5334/jors.be

From the above paper:

The idea of transitive credit is as follows: The credit map for product A, which is used by product B, feeds into the credit map for product B. For example, product A is a software package equally written by two authors and its credit map is that 50 percent of the credit for this should go the lead developer, 20 percent to the second developer, and 10 percent to the third developer. In addition, 5 percent should go to each of the four libraries that are needed to run the code. When this product is created and registered, this credit map is registered along with it. Product B is a paper that obtains new science results, and it depended on Product A. The person who registers the publication also registers its credit map, in this case 75 percent to her/himself, and 25 percent to the software code previous mentioned. Credit is now transitive, in that the lead software developer of the code can be given credit for 12.5 percent of the paper. If another paper is later written that extends the product B paper and gives 10% credit to that paper, the lead software package developer will also have 1.25% credit for the new paper.
The idea of being able to track credit across derived products is interesting, and is especially relevant to projects such as GBIF, where users can download large datasets that are themselves aggregations of data from numerous different providers (making it was to calculate the relative contributions of each provider). If we then track citations of that data (and citations of those citations) we could give data providers a better estimate of the actual impact of their data.


Euan Adie of altimetric talked about "impact", and remarked on an example of a paper being cited in a policy document and this being picked up by altimetric and seen by the authors of the paper, who had no idea that their work had influenced a policy document. This raises some intriguing possibilities, related to the idea of "transitive credit" above.

In building BioNames I've added the ability to show altimetric "donuts" and I'm struck by examples like this one (see also reference in BioNames):

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

This paper has no recent "buzz" (e.g., Twitter, Facebook, Mendeley) but is cited on three Wikipedia pages. So, this paper has impact, albeit in social media. Many papers like this will slip below the social media radar but will be used by various databases and may contribute to subsequent work. Perhaps we could expand alt metrics sources of information to include some of those databases. For example, if a paper has been aggregated/cited by a major databases (such as GBIF) then it would be nice to see that on the altimetric donut. For authors this gives them another example of the impact of their work, but for the databases it's also an opportunity to increase engagement (if people have relevant work that doesn't appear in the donut they can take steps to have that work included in the aggregation). Obviously there are issues about what databases to count as providing signal for alt metrics, but there's scope here to broaden and quantify our notion of impact.


The ReCon hackney was an pretty informal event held at CodeBase just down from Edinburgh Castle, and apparently the largest start-up incubator in the European tech scene. It was a pretty amazing place, and a great venue for a hackney. I spent the day looking at the ORCID API and seeing if I could create some mashups with Journal Map and my own BioNames. One goal was to see if we could generate a map of researcher's study sites starting with their ORCID, using ORCID's API to retrieve a list of their publications, then talking to the Journal Map API to get point localities for those papers. The code worked, but the results were a little disappointing because Jim Caryl and I were focussing on University of Glasgow researchers, and they had few papesri n Journal Map. The code, such as it is, is in GitHub.

My original idea was to focus on BioNames, and see how many authors of taxonomic papers had ORCIDs. Initial experiments seemed promising (see GitHub for code and data). Time was limited, so I got as far has building lists of DOIs from BioNames and discovering the associated ORCIDs. The next steps would be (a) providing ORCID login to BioNames, and using ORCID to help cluster author name strings in BioNames. Still much to do.

I've not been to many hackdays/hackathons, but I find them much more rewarding than simply sitting in a lecture theatre and listening to people talk. Combining both types of meeting is great, and I look forward to similar event sin the future.

Visualising Geophylogenies in Web Maps Using GeoJSON

Fig3 GoogleMaps CC BY no logo 300x205I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life:

Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.
At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.

I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:

Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.

But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.

Tuesday, May 19, 2015

Text mining for museum specimen identifiers

Class EMuMedia phpThis post is a response to Ross Mounce's post Text mining for museum specimen identifiers. As Ross notes in that post, mining literature for specimen codes is something I've been interested in for a while (search for specimen codes on iPhylo), and @Aime Rankin (formerly an undergraduate student at Glasgow) did some work on this as well. It's great to see progress in this area.

Here are some thoughts on Ross's post (I'm posting here rather than as a comment on Ross's blog because this is going to be long).

What questions to ask?

Obviously there's a lot of scope for metrics, such as numbers of citations for individual specimens, and league tables for collections (see GBIF specimens in BioStor: who are the top ten museums with citable specimens?). As Ross notes, there's also scope for updating out of date museum metadata with information from the literature (e.g., Linking data from the NHM portal with content in BHL), but even more interesting is the potential to cross-link databases in a way that permits novel queries. For example, if we have a paper on a disease that includes data we can link to a georeferenced specimen, then we can enable spatial queries for diseases (e.g., BHL and GBIF as biomedical databases).

Materials for mining

From my perspective the obvious corpus to mine is the Biodiversity Heritage Library (BHL). Ross repeats the erroneous view that BHL is just "legacy" literature. Apart from the obvious point that everything not published right not is, by definition, legacy, BHL has a lot of modern content (including papers published in the last couple of years).

Furthermore, there are journals that cite Natural History Museum specimens, including "in house" journals (e.g., Bulletin of the British Museum (Natural History) Zoology and Bulletin of the Natural History Museum. Zoology series), as well as the Bulletin of the British Ornithologists' Club which has published lots of new bird names for which the type specimen is often in the NHM.

I guess one issue is accessibility. Ross notes that:

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.
So, how we can make BHL content as accessible? For each article I've extracted from BHL and stored in BioStor you can get full text by simply appending ".text" to the BioStor URL, but this isn't quite the same as grabbing a big dump of text.

The other source of mining is GenBank, which has a lot of sequences that have NHM vouchers, but also a weird and wonderful array of ways of recording those specimens. This is one reason I'm building "Material examined", to cope with these codes. For example sequence KF281084 has voucher "TRING 1877111743" which more traditionally would be written as "BMNH 1877.11.17.43", which is "NHMUK 1877.11.17.43" in the NHM database. This is just one example of the horrors of matching specimen codes (for more see the code for Material examined).

One reason GenBank is useful is that the sequences are often linked to the literature, which means you get to make the link between specimen and literature without actually needing to mine the text itself (handy if access is problematic).

Bonus question: How should I publish this annotation data?

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?
Personally I'd avoid RDF because that way lies madness (or at least endless detours haggling about ontologies).

But making the output useful is an important question. Despite the fact that it is a bit clunky, I suspect Darwin Core Archives are the way to go. The core data is a CSV table, so it's easy to generate, and also easy to use. Lets say you analysed a particular corpus (e.g., PLoS ONE), you could then output the data in Darwin Core (making sure both specimen and publication had stable identifiers), then package it up and upload to Zenodo or Figshare and get a DOI. For bonus points, it would be great to see this data on GBIF, but this would require (a) mapping NHM specimen codes to GBIF ids (the NHM has this), and (b) GBIF being able to recognise that the data you're adding is not new specimens but rather annotations of existing specimens.

Things to think about

Here are a couple of additional things to think about.

Specimen finding as a service

In the same way that we have taxonomic name-finding services, it would be great if we had a specimen code-finding service. I have code that I use in BioStor, but it would be great to have something that is robust, stable, and generalisable across multiple specimen codes. My tool Material examined focusses on parsing a single string rather than parsing a block of text, but adding that functionality is an obvious thing to do.

Markup as output

One concern I have with work that involves mining text is that we hardly ever store the intermediate step of text + located elements. Instead we get to see sumamry output (e.g., this page has these three scientific names, and these 10 specimen codes). As Terry Catapano (@catapanoth) once wisely pointed out "indexing is markup", in that if you find a substring in some text, you have in effect marked up the text. Can we preserved the marked up text so that we go back and look at it and improve our text mining methods, or make that markup available to others to build upon it? There are all sorts of things which could be built upon this information, for example, imaging if the results where given to BHL so that people could search by specimen code.