Wednesday, February 03, 2016

Bootstrapping the biodiversity knowledge graph with JSON-LD

In a recent Twitter conversation including David Shorthous and myself (and other poor souls who got dragged in) we discussed how to demonstrate that adopting JSON-LD as a simple linked-data friendly format might help bootstrap the long awaited "biodiversity knowledge graph" (see below for some suggestions for keeping JSON-LD simple). David suggests partnering with "Three small, early adopting projects". I disagree.

I think we need to approach this problem, not from the perspective of who would like to try this approach, but what would it take to really be useful? I think if we look at the structure of the biodiversity knowledge graph and look at who has what identifiers, we can gain some insight into what the next steps are.

Sinks

A fundamental problem is that many data providers are essentially dead ends in terms of building a network. They have data, but no connections, and so they are "sinks". A data browser goes there and can't go any further, it has to retrace its steps and go somewhere else. For example, your typical museum might serve up its collection data like this: Museum The museum has its own identifiers (e.g., a URL) and that's the only identifier in the data. Nomenclators are much the same: Nomen You get a nomenclator-specific identifier such as an LSID, but no other identifiers. Once again, if you are a data-crawler traversing the web of data, this is a dead end. (This is why I've become obsessed with linking nomenclators to the primary literature, I want to stop them being dead ends).

Connected sources

Then there are sources which have at least one external identifier, that is, an identifier that they themselves don't control. For example, CrossRef manages DOIs for articles, but if you get metadata for a DOI you also get an ISSN (identifying a journal): Crossref Now we have a connection to an external source of data, and some we can traverse the data graph further, which means we can start asking questions (such as how many articles in this journal have DOIs?).

Another connected source is ORCID: Orcid This gives us identifiers for authors (the ORCID) linked to article identifiers such as DOIs and PubMed ids (PMID). Follow the DOI to CrossRef and we can link people to articles to journals. Oc

Another connected source is the NCBI: Ncbi NCBI has several internal identifiers (PMIDs, GenBank accession numbers, tax_ids) all of which lead to rich resources, and it's possible to get a lot of information by staying within NCBI's own silo, but there are external links such as DOIs (usually found attached to PubMed articles or to GenBank accessions) and links to external records such as DNA barcodes in BOLD.

Lets join CrossRef, ORCID, and NCBI together: All Now we have a bipartite graph linking sources with identifiers. We can imagine playing a game where we try and connect different entities by moving through the graph. For example, we could take an author identified by an ORCID, follow a DOI to a PMID, a PMID to a set of accessions numbers, and then be able to list all the sequences that an author has published. We could also step back and ask questions about which identifiers are the most useful in terms of making connections between different sources, and which sources provide the most cross links.

Filling in the gaps

Now, the graph above is obviously incomplete. I've restricted it to a few of the key services that I'm familiar with, and make use of external identifiers. One of the big obstacles to fleshing out the biodiversity knowledge graph is the frequent lack of reuse of identifiers. It's not enough to pump out data in linked-data form, you need to build the links. And not just "same as" style links connecting multiple identifiers for the same thing, you need to connect identifiers for different things. Until we tackle that, linked data approaches will not deliver much in the way of value. Hence we need sources that provide genuinely linked data by reusing exiting, external identifiers. If we have name in a nomenclator with a citation string, it needs to link that dumb literature string to a DOI. If we have a taxon concept, it needs to link the name of that taxon to a name in a nomenclator. If we have a specimen that has been sequenced, it needs to give the accession number of that sequence.

From this perspective, choosing which kinds of data with which to explore JSON-LD and a linked data graph should be driven by how connected those sources are: the more connected the more interesting the questions that we can ask. Sadly the vast majority of biodiversity data providers don't provide the kind of connected data we need, which means we continually pay lip service to linked data without feeding it the kind of data it needs in order to grow.

* Notes on JSON-LD

It is relatively easy to write horrible JSON-LD, so I think it would be useful to strive to make it as simple and as human-readable as possible (ignoring the @context block which is always going to be awful, this is the price we pay for simplicity elsewhere). To this end I think we should do at least the following:

  1. No URLs as identifiers URLs are ugly, and they are not reliable as identifiers. Providers can change URLs, even for persistent identifiers. The DOI prefix has changed from http://dx.doi.org/ to the preferred http://doi.org/, and the rise of HTTPS (prompted by concerns about security, among other issues, see Google's motives, part 2) is going to break a lot of older URLs (see Web Security - "HTTPS Everywhere" harmful). These changes keep happening, so lets try and shield ourselves from these by using standard prefixes for identifiers (such as 'DOI') for a DOI. Put all the alternative URL resolver prefixes in the @context block and use prefixes such as those standardised by http://identifiers.org (which basically formalises what many people in bioinformatics have been doing for a while).
  2. No prefixes for keys In other words, no CURIEs. Don't write "dwc:CatalogueNumber", just write "CatalogueNumber". I don't want to be told where the term comes from (that's what @context is for). If you have a namespace clash (i.e., two terms that are the same unless you include the namespace) then IMHO you're doing it wrong. Either you're using more than one vocabulary for the same thing (why?), or you've not modelling the data at the appropriate level of granularity. Either way, let's avoid clutter and keep things simple and readable.

Monday, January 25, 2016

iSpecies is back: mashing up species data

A decade ago (OMG, that can't be right, an actual decade ago) I created "iSpecies", a simple little tool to mashup a variety of data from GBIF, NCBI, Yahoo, Wikipedia, and Google Scholar to create a search engine for species. It was written in PHP, relied on some degree of *cough* web scraping to get its data, and was a bit of a toy (although that didn't stop me complaining that it could do more than EOL at the time). Eventually I got sick of dealing with Google Scholar constantly changing it's HTML and blocking IP addresses to stop people harvesting data (I once managed to get my entire campus blocked), or services disappearing such as Yahoo's image search, and I eventually pulled the plug on it.

A short course I run on "phyloinformatics" starts this week and one of the examples I show is a crude Javascript-based mashup. It struck me that I could tweak that and recreate a simple version of iSpecies, and that's exactly what I've done: http://ispecies.org.

Ispecies

It's nothing fancy, just takes a species name and searches GBIF, EOL, CrossRef, and Open Tree of Life, grabs some data and puts it together on a web page. There are lots of limitations (e.g., only fetches the first 300 localities in GBIF, requires scientific names, tree viewer is pretty awful) but it was pretty simple to put together. It's entirely client-side based, the code is all in the HTML file (and a few Javascript libraries) (the code is on GitHub: https://github.com/rdmpage/ispecies).

Fun as this was, there's a bigger problem with iSpecies and that's that it is a "mashup". I'm simply grabbing data from different sources and redisplaying it. What I really want is what has been described as a "mashup" (awful term, don't use it), that is, I want to combine the data so that it is more than the sum of its parts. For example, some of the data could be cross linked (especially if add a few more sources and we drill down a bit). Some of the papers discovered by CrossRef may include original descriptions, or may be the source of some of the points plotted on the GBIF map. Some may include the phylogenies used to build the Open Tree of Life tree. In order to build a data mashup instead of a web mashup we need to operate at the level of data rather than just human-readable web pages. That is the next thing I'd like to work on, and in many ways it shouldn't be a big leap. The new iSpecies was fairly easy to create because we now have a bunch of web services that all speak JSON. It's a small step from JSON to JSON-LD (especially if the JSON-LD is constructed with reuse in mind). So while it's nice to see iSpecies back, there's a much more interesting next step to think about.

Wednesday, January 13, 2016

Surfacing the deep data of taxonomy

My paper "Surfacing the deep data of taxonomy" (based on a presentation I gave in 2011) has appeared in print as part to a special issue of Zookeys:

Page, R. (2016, January 7). Surfacing the deep data of taxonomy. ZooKeys. Pensoft Publishers. http://doi.org/10.3897/zookeys.550.9293
The manuscript was written shortly after the talk, but as is the nature of edited volumes it's taken a while to appear.

My tweet about the paper sparked some interesting comments from David Shorthouse.

This is an appealing vision, because it seems unlikely that having multiple, small communities clustered around taxa will ever have the impact that taxonomists might like to have. Perhaps if we switch to focussing on objects (sequences, specimens, papers), notions of identity (e.g., DOIs, ORCID), and alternative measures of impact we can increase the visibility and perceived importance of the field. In this context, the recent paper "Wikiometrics: A Wikipedia Based Ranking System" http://arxiv.org/abs/1601.01058 looks interesting. A big consideration will be how connected is the network connecting taxonomists, papers, sequences, specimens, and names. If it's anything like the network of readers in Mendeley then we may face some challenges in community building around such a network.

Thursday, December 17, 2015

Will JSON, NoSQL, and graph databases save the Semantic Web?

OK, so the title is pure click bait, but here's the thing. It seems to me that the Semantic Web as classically conceived (RDF/XML, SPARQL, triple stores) has had relatively little impact outside academia, whereas other technologies such as JSON, NoSQL (e.g., MongoDB, CouchDB) and graph databases (e.g., Neo4J) have got a lot of developer mindshare.

In biodiversity informatics the Semantic Web has been a round for a while. We've been pumping out millions of RDF documents (mostly served by LSIDs) since 2005 and, to a first approximation, nothing has happened. I've repeatedly blogged about why I think this is (see this post for a summary).

I was an early fan of RDF and the Semantic Web, but soon decided that it was far more hassle than it was worth. The obsession with ontologies, the problems of globally unique identifiers based on HTTP (http-14 range, anyone?), the need to get a lot of ducks in a row all mad it a colossal pain. Then I discovered the NoSQL document database CouchDB, which is a JSON store that features map-reduce views rather than on the fly queries. To somebody with a relational database background this is a bit of a headfuck:

Fault tolerance

But CouchDB has a great interface, can be replicated to the cloud, and is FUN (how many times can you say that about a database?). So I starting playing with CouchDB for small projects, then used it to build BioNames and more recently moved BioStor to CouchDB hosted both locally and in the cloud.

Then there are graph databases such as Neo4J, which has some really cool things such as GraphGists which is a playground where you can create interactive graphs and query them (here's an example I created). Once again, this is FUN.

Another big trend over the last decade is the flight from XML and its hideous complexities (albeit coupled with great power) to the simplicity of JSON (part of the rise of JavaScript). JSON makes it very easy to pass around data in simple key-value documents (with more complexity such as lists if you need them). Pretty much any modern API will serve you data in JSON.

So, what happened to RDF? Well, along with a plethora of formats (XML, triples, quads, etc., etc.) it adopted JSON in the form of JSON-LD (see JSON-LD and Why I Hate the Semantic Web for background). JSON-LD lets you have data in JSON (which both people and machines find easy to understand) and all the complexity/clarity of having the data clearly labelled using controlled vocabularies such as Dublin Core and schema.org. This complexity is shunted off into a "@context" variable where it can in many cases be safely ignored.

But what I find really interesting is that instead of JSON-LD being a way to get developers interested in the rest of the Semantic Web stack (e.g. HTTP URIs as identifiers, SPARQL, and triple stores), it seems that what it is really going to do is enable well-described structured to get access to all the cool things being developed around JSON. For example, we have document databases such as CouchDB which speaks HTTP and JSON, and search servers such as ElasticSearch which make it easy to work with large datasets. There are also some cool things happening with graph databases and Javascript, such as Hexastore (see also Weiss, C., Karras, P., & Bernstein, A. (2008, August 1). Hexastore. Proc. VLDB Endow. VLDB Endowment. http://doi.org/10.14778/1453856.1453965, PDF here) where we create the six possible indexes of the classic RDF [subject,predicate,object] triple (this is the sort of thing can also be done in CouchDB). Hence we can have graph databases implemented in a web browser!

So, when we see large-scale "Semantic Web" applications that actually exist and solve real problems, we may well be more likely to see technologies other than the classic Semantic Web stack. As an example, see the following paper:

Szekely, P., Knoblock, C. A., Slepicka, J., Philpot, A., Singh, A., Yin, C., … Ferreira, L. (2015). Building and Using a Knowledge Graph to Combat Human Trafficking. The Semantic Web - ISWC 2015. Springer Science + Business Media. http://doi.org/10.1007/978-3-319-25010-6_12

There's a free PDF here, and a talk online. The consortium behind this project researchers did extensive text mining, data cleaning and linking, creating a massive collection of JSON-LD documents. Rather than use a triple store and SPARQL, they indexed the JSON-LD using ElasticSearch (notice that they generated graphs for each of the entities they care about, in a sense denormalising the data).

I think this is likely to be the direction many large-scale projects are going to be going. Use the Semantic Web ideas of explicit vocabularies with HTTP URIs for definitions, encode the data in JSON-LD so it's readable by developers (no developers, no projects), then use some of the powerful (and fun) technologies that have been developed around semi-structured data. And if you have JSON-LD, then you get SEO for free by embedding that JSON-LD in your web pages.

In summary, if biodiversity informatics wants to play with the Semantic Web/linked data then it seems obvious that some combination of JSON-LD with NoSQL, graph databases, and search tools like ElasticSearch are the way to go.

Wednesday, December 09, 2015

Visualising the difference between two taxonomic classifications

It's a nice feeling when work that one did ages ago seems relevant again. Markus Döring has been working on a new backbone classification of all the species which occur in taxonomic checklists harvested by GBIF. After building a new classification the obvious question arises "how does this compare to the previous GBIF classification?" A simple question, answering it however is a little tricky. It's relatively easy to compare two text files -- and this function appears in places such as Wikipedia and GitHub -- but comparing trees is a little trickier. Ordering in trees is less meaningful than in text files, which have a single linear order. In other words, as text strings "(a,b,c)" and "(c,b,a)" are different, but as trees they are the same.

Classifications can be modelled as a particular kind of tree where (unlike, say, phylogenies) every node has a unique label. For example, the tips may be species and the internal nodes may be higher taxa such as genera, families, etc. So, what we need is a way of comparing two rooted, labelled trees and finding the differences. Turns out, this is exactly what Gabriel Valiente and I worked on in this paper doi:10.1186/1471-2105-6-208. The code for that paper (available on GitHub) computes an "edit script" that gives a set of operations to convert one fully labelled tree into another. So I brushed up my rusty C++ skills (I'm using "skills" loosely here) and wrote some code to take two trees and the edit script, and create a simple web page that shows the two trees and their differences. Below is a screen shot showing a comparison between the classification of whales in the Mammals Species of the World, and one from GBIF (you can see a live version here).

Treediff

The display uses colours to show whether a nodes has been deleted from the first tree, inserted into the second tree, or moved to a different position. Clicking on a node in one tree scrolls the corresponding node in the other tree (if it exists) to scroll into view. Most of the differences between the two trees are due to the absence of fossils from Mammals Species of the World, but there are other issues such as GBIF ignoring tribes, and a few taxa that are duplicated due to spelling typos, etc.

Tuesday, December 01, 2015

Frontiers of biodiversity informatics and modelling species distributions #GBIFfrontiers @AMNH videos

For those of you who, like me, weren't at the "Frontiers Of Biodiversity Informatics and Modelling Species Distributions" held at the AMNH in New York, here are the videos of the talks and panel discussion, which the organisers have kindly put up on Vimeo with the following description:

The Center for Biodiversity and Conservation (CBC) partnered with the Global Biodiversity Information Facility (GBIF) to host a special "Symposium and Panel Discussion: Frontiers Of Biodiversity Informatics and Modelling Species Distributions" at the American Museum of Natural History on November 4, 2015.

The event kicked off a working meeting of the GBIF Task Group on Data Fitness for Use in Distribution Modelling at the City College of New York on November 5-6. GBIF convened the Task Group to assess the state of the art in the field, to connect with the worldwide scientific and modelling communities, and to share a vision of how GBIF support them in the coming decade.

The event successfully convened a broad, global audience of students and scientists to exchange ideas and visions on emerging frontiers of biodiversity informatics. Using inputs from the symposium and from a web survey of experts, the Data Fitness task group will prepare a report, which will be open for consultation and feedback at GBIF.org and on the GBIF Community Site in December 2015.

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2UThis guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

  • Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
  • Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
  • Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
  • Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
  • Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!

Wednesday, November 18, 2015

Comments on "Widespread mistaken identity in tropical plant collections"

Zoë A. Goodwin (@Drypetes) and collegagues have published a paper with a title guaranteed to get noticed:

Goodwin, Z. A., Harris, D. J., Filer, D., Wood, J. R. I., & Scotland, R. W. (2015, November). Widespread mistaken identity in tropical plant collections. Current Biology. Elsevier BV. http://doi.org/10.1016/j.cub.2015.10.002http://doi.org/10.1016/j.cub.2015.10.002

Their paper argues that "more than half of all tropical plant collections may be wrongly named." This is clearly a worrying conclusion with major implications for aggregators such as GBIF that get the bulk of their data (excluding birds) from museums and herbaria.

I'm quite prepared to accept that there are going to be considerable problems with herbarium and museum labels, but there are aspects of this study that are deeply frustrating.

Where's the data?

The authors don't provide any data! This difficult to understand, especially as they downloaded data from GBIF which provides DOIs for each and every download. Why don't the authors cite those DOIs (which would enable others to grab the same data, and also ultimately provides a way to provide credit to the original data providers)? The authors obtained data from multiple herbaria, matched specimens that were the same, and compared their taxonomic names. This is a potentially very useful data set, but the authors don't provide it. Anybody wanting to explore the problem immediately hits a brick wall.

Unpublished taxonomy

The first group of plants the authors looked at is Aframomum, and they often refer to a recent monograph of this genus which is cited as "Harris, D.J., and Wortley, A.H. (In Press). Monograph of Aframomum (Zingiberaceae). Syst. Bot. Monogr.". As far as I can tell, this hasn't been published. This not only makes it hard for the reader to investigate further, it means the authors mention a name in the paper that doesn't seem to be have been published:
In 2014 the plant was recognized as a new species, Aframomum lutarium D.J.Harris & Wortley, by Harris & Wortley as part of the revision of the genus Aframomum
I confess ignorance of the Botanical Code of Nomenclature, but in zoology this is a no no.

What specimen is show in Fig. 1?

Gr1 lrg Figure 1 shows a herbarium specimen, but there's no identifier or link for the specimen. Is this specimen available online? Can I see it in GBIF? Can I see it's history and explore further? if not, why not? If it's not available online, why not pick one that is?

What is "wrong"?

The authors state:
Examination of the 560 Ipomoea names associated with 49,500 specimens in GBIF (Figure S1A) revealed a large proportion of the names to be nomenclatural and taxonomic synonyms (40%), invalid, erroneous or unrecognised names (16%, ‘invalid’ in Figure S1A). In addition, 11% of the specimens in GBIF were unidentified to species.
Are synonyms wrong? If it's a nomenclatural synonym, then it's effectively the same name. If it's a taxonomic synonym, then is that "wrong"? Identifications occur at a given time, and our notion of what constitutes a taxon can change over time. It's one thing to say a specimen has been assigned to a taxon which we now regard as a synonym of another, quite another to say that a specimen has been wrongly identified. What are "invalid, erroneous or unrecognised names"? Are these typos, or genuinely erroneous names? Once again, if the authors provided the data we could investigate. But they haven't, so we can't examine whether their definition of "wrong" is reasonable.

I'm all for people flagging problems (after all, I've made something of career of it), but surely one reason for flagging problems is so that they can be recognised and fixed. By basing their results on an unpublished monograph, and by not providing any data in support of their conclusions, the authors prevent those of us interested in fixing problems being able to drill down and understand the nature of the problem. If the authors had both published the article and provided the data they would have done the community a real service. Instead we are left with a study with a click bait title that will get lots of attention, but which doesn't provide any way for people to make progress on the problem the paper identifies.