Tuesday, December 01, 2015

Guest post: 10 years of global biodiversity databases: are we there yet?

YtNkVT2UThis guest post by Tony Rees explores some of the themes from his recent talk 10 years of Global Biodiversity Databases: Are We There Yet?.

A couple of months ago I received an invitation to address the upcoming 2015 meeting of the Malacological Society of Australasia (Australian and New Zealand mollusc persons for the uninitiated) on some topic associated with biodiversity databases, and I decided that a decadal review might be an interesting exercise, both for my potential audience (perhaps) and for my own interest (definitely). Well, the talk is delivered and the slides are available on the web for viewing if interested, and Rod has kindly invited me to present some of its findings here, and possibly stimulate some ongoing discussion since a lot of my interests overlap his own quite closely. I was also somewhat influenced in my choice of title by a previous presentation of Rod's from some 5 years back, "Why aren't we there yet?" which provides a slightly pessimistic counterpoint to my own perhaps more optimistic summary.

I decided to construct the talk around 5 areas: compilations of taxonomic names and associated valid/accepted taxa; links to the literature (original citations, descriptions, more); machine-addressable lists of taxon traits; compilations of georeferenced species data points such as OBIS and GBIF; and synoptic efforts in the environmental niche modelling area (all or many species so as to be able to produce global biodiversity as well as single-species maps). Without recapping the entire content of my talk (which you can find on SlideShare), I thought I would share with readers of this blog some of the more interesting conclusions, many of which are not readily available elsewhere, at least not with effort to chase down and/or make educated guesses.

In the area of taxonomic names, for animals (sensu lato) ION has gone up from 1.8m to 5.2m names (2.8m to 3.5m indexed documents) from all ranks (synonyms not distinguished) over the cited period 2005-2015, while Catalogue of Life has gone up from 0.5m species names + ?? synonyms to 1.6m species names + 1.3m synonyms over the same period; for fossils, BioNames database is making some progress in linking ION names to external resources on the web but, at less than 100k such links, is still relatively small scale and without more than a single-operator level of resourcing. A couple of other "open access" biological literature indexing activities are still at a modest level (e.g. 250k-350k citations, as against an estimated biological literature of perhaps 20m items) at present, and showing few signs of current active development (unless I have missed them of course).

Comprehensive databases of taxon traits (in machine addressable form) appear to have started with the author’s own "IRMNG" genus- and species- level compendium which was initially tailored to OBIS needs for simply differentiating organisms into extant vs. fossil, marine vs. nonmarine. More comprehensive indexes exist for specific groups and recently, Encyclopedia of Life has established "TraitBank" which is making some headway although some of the "traits" such as geographic distribution (a bounding box from either GBIF or OBIS) and "number of GenBank sequences" stretch the concept of trait a little (just my two cents' worth, of course), and the newly created linkage to Google searches is to be applauded.

With regard to aggregating georeferenced species data (specimens and observations), both OBIS (marine taxa only) and GBIF (all taxa) have made quite a lot of progress over the past ten years, OBIS increasing its data holdings ninefold from 5.6m to 44.9m (from 38 to 1,900+ data providers) and GBIF more than tenfold from 45m to 577m records over the same period, from 300+ to over 15k providers. While these figures look healthy there are still many data gaps in holdings e.g. by location sampled, year/season, ocean depth, distance to land etc. and it is probably a fair question to ask what is the real "end point" for such projects, i.e. somewhere between "a record for every species" and "a record for every individual of every species", perhaps...

Global / synoptic niche modelling projects known to the author basically comprise Lifemapper for terrestrial species and AquaMaps for marine taxa (plus some freshwater). Lifemapper claims "data for over 100,000 species" but it is unclear whether this corresponds to the number of completed range maps available at this time, while AquaMaps has maps for over 22,000 species (fishes, marine mammals and invertebrates, with an emphasis on fishes) each of which has a point data map, a native range map clipped to where the species is believed to occur, an "all suitable habitat map" (the same unclipped) and a "year 2100 map" showing projected range changes under one global warming scenario. Mapping parameters can also be adjusted by the user using an interactive "create your own map" function, and stacking all completed maps together produces plots of computed ocean biodiversity plus the ability to undertake web-based "what [probably] lives here" queries for either all species or for particular groups. Between these two projects (which admittedly use different modelling methodologies but both should produce useful results as a first pass) the state of synoptic taxon modelling actually appears quite good, especially since there are ongoing workshops e.g. the recent AMNH/GBIF workshop Biodiversity Informatics and Modelling Species Distributions at which further progress and stumbling blocks can be discussed.

So, some questions arising:

  • Who might produce the best "single source" compendium of expert-reviewed species lists, for all taxa, extant and fossil, and how might this happen (my guess: a consortium of Catalogue of Life + PaleoBioDB at some future point)
  • Will this contain links to the literature, at least citations but preferably as online digital documents where available? (CoL presently no, PaleoBioDB has human-readable citations only at present)
  • Will EOL increasingly claim the "TraitBank" space, and do a good enough job of it? (also bearing in mind that EOL is still an aggregator, not an original content creator, i.e. somebody still has to create it elsewhere)
  • Will OBIS and/or GBIF ever be "complete", and how will we know when we’ve got there (or, how complete is enough for what users might require)?
  • Same for niche modelling/predicted species maps: will all taxa eventually be covered, and will the results be (generally) reliable and useable (and at what scale); or, what more needs to be done to increase map quality and reliability.

Opinions, other insights welcome!

Wednesday, November 18, 2015

Comments on "Widespread mistaken identity in tropical plant collections"

Zoë A. Goodwin (@Drypetes) and collegagues have published a paper with a title guaranteed to get noticed:

Goodwin, Z. A., Harris, D. J., Filer, D., Wood, J. R. I., & Scotland, R. W. (2015, November). Widespread mistaken identity in tropical plant collections. Current Biology. Elsevier BV. http://doi.org/10.1016/j.cub.2015.10.002http://doi.org/10.1016/j.cub.2015.10.002

Their paper argues that "more than half of all tropical plant collections may be wrongly named." This is clearly a worrying conclusion with major implications for aggregators such as GBIF that get the bulk of their data (excluding birds) from museums and herbaria.

I'm quite prepared to accept that there are going to be considerable problems with herbarium and museum labels, but there are aspects of this study that are deeply frustrating.

Where's the data?

The authors don't provide any data! This difficult to understand, especially as they downloaded data from GBIF which provides DOIs for each and every download. Why don't the authors cite those DOIs (which would enable others to grab the same data, and also ultimately provides a way to provide credit to the original data providers)? The authors obtained data from multiple herbaria, matched specimens that were the same, and compared their taxonomic names. This is a potentially very useful data set, but the authors don't provide it. Anybody wanting to explore the problem immediately hits a brick wall.

Unpublished taxonomy

The first group of plants the authors looked at is Aframomum, and they often refer to a recent monograph of this genus which is cited as "Harris, D.J., and Wortley, A.H. (In Press). Monograph of Aframomum (Zingiberaceae). Syst. Bot. Monogr.". As far as I can tell, this hasn't been published. This not only makes it hard for the reader to investigate further, it means the authors mention a name in the paper that doesn't seem to be have been published:
In 2014 the plant was recognized as a new species, Aframomum lutarium D.J.Harris & Wortley, by Harris & Wortley as part of the revision of the genus Aframomum
I confess ignorance of the Botanical Code of Nomenclature, but in zoology this is a no no.

What specimen is show in Fig. 1?

Gr1 lrg Figure 1 shows a herbarium specimen, but there's no identifier or link for the specimen. Is this specimen available online? Can I see it in GBIF? Can I see it's history and explore further? if not, why not? If it's not available online, why not pick one that is?

What is "wrong"?

The authors state:
Examination of the 560 Ipomoea names associated with 49,500 specimens in GBIF (Figure S1A) revealed a large proportion of the names to be nomenclatural and taxonomic synonyms (40%), invalid, erroneous or unrecognised names (16%, ‘invalid’ in Figure S1A). In addition, 11% of the specimens in GBIF were unidentified to species.
Are synonyms wrong? If it's a nomenclatural synonym, then it's effectively the same name. If it's a taxonomic synonym, then is that "wrong"? Identifications occur at a given time, and our notion of what constitutes a taxon can change over time. It's one thing to say a specimen has been assigned to a taxon which we now regard as a synonym of another, quite another to say that a specimen has been wrongly identified. What are "invalid, erroneous or unrecognised names"? Are these typos, or genuinely erroneous names? Once again, if the authors provided the data we could investigate. But they haven't, so we can't examine whether their definition of "wrong" is reasonable.

I'm all for people flagging problems (after all, I've made something of career of it), but surely one reason for flagging problems is so that they can be recognised and fixed. By basing their results on an unpublished monograph, and by not providing any data in support of their conclusions, the authors prevent those of us interested in fixing problems being able to drill down and understand the nature of the problem. If the authors had both published the article and provided the data they would have done the community a real service. Instead we are left with a study with a click bait title that will get lots of attention, but which doesn't provide any way for people to make progress on the problem the paper identifies.

Wednesday, September 23, 2015

Visualising big phylogenies (yet again)

Inspired in part by the release of the draft tree of life (doi:10.1073/pnas.1423041112 by the Open Tree of Life, I've been revisiting (yet again) ways to visualise very big phylogenies (see Very large phylogeny viewer for my last attempt).

My latest experiment uses Google Maps to render a large tree. Googletree Google Maps uses "tiles" to create a zoomable interface, so we need to create tiles for different zoom levels for the phylogeny. To create this visualisation I did the following:

  1. The phylogeny laid out in a 256 x 256 grid.
  2. The position of each line in the drawing is stored in a MySQL database as a spatial element (in this case a LINESTRING)
  3. When the Google Maps interface needs to display a tile at a certain zoom level and location, a spatial SQL query retrieves the lines that intersect the bounds of the tile, then draws those using SVG.
Hence, the tiles are drawn on the fly, rather than stored as images on disk. This is a crude proof of concept so far, there are a few issues to tackle to make this usable:
  1. At the moment there are no labels. I would need a way to compute what tables to show at what zoom level ("level of detail"). In other words, at low levels of zoom we want just a few higher taxa to be picked out, whereas as we zoom in we want more and more taxa to be labelled, until at the highest zoom levels we have the tips individually labelled.
  2. Ideally each tile would require roughly the same amount of effort to draw. However, at the moment the code is very crude and simply draws every line that intersects a tile. For example, for zoom level 0 the entire tree fits on a single tile, so I draw the entire tree. This is not going to scale to very large trees, so I need to be able to "cull" a lot of the lines and draw only those that will be visible.
In the past I've steered away from Google Maps-style interfaces because the image is zoomed along both the x and y axes, which is not necessarily ideal. But the tradeoff is that I don't have to do any work handling user interactions, I just need to focus on efficiently rendering the image tiles.

All very crude, but I think this approach has potential, especially if the "level of detail" issue can be tackled.

Friday, September 18, 2015

Towards an interactive web-based phylogeny editor (à la MacClade)

Currently in classes where I teach the basics of tree building, we still fire up ancient iMacs, load up MacClade, and let the students have a play. Typically we give them the same data set and have a class competition to see which group can get the shortest tree by manually rearranging the branches. It’s fun, but the computers are old, and what’s nostalgic for me seems alien to the iPhone generation.

One thing I’ve always wanted to have is a simple MacClade-like tree editor for the Web, where the aim is not so much character analysis as teaching the basics of tree building. Something with the easy of use as Phylo (basically Candy Crush for sequence alignments).


The challenge is to keep things as simple as possible. One idea is to have a column of taxa and you can drag individual taxa up and down to rearrange the tree.

Interactive tree design notes

Imagine each row has the characters and their states. Unlike the Phylo game, where the goal is to slide the amino acids until you get a good aliognment, here we want to move the taxa to improve the tree (e.g., based on its parsimony score).

The problem is that we need to be able to generate all possible rearrangements for a given number of taxa. In the example above, if we move taxon C, there are five possible positions it could go on the remaining subtree:

2 But if we simply shuffle the order of the taxa we can’t generate all the trees. However, if we remember that we also have the internal nodes, then there is a simple way we can generate the trees. 1 When we draw a tree each row corresponds to a node. The gap between each pair of leaves (the taxa A,B,D) corresponds to the an internal nodes. So we could divide the drawing up into “hit zone”, so that if you drag the taxon we’re adding (“C”) onto the zone centred on a leaf, we add the taxon below that leaf; if we drag it onto a zone between two leaves, we attach it below that the corresponding internal node. From the user’s point of view they are still simply sliding taxa up and down, but in doing so we can create each of the possible trees.

We could implement this in a Web browser with some Javascript to handle the user moving the taxa, and draw the corresponding phylogeny to the left, and quickly update the (say, parsimony) score of the tree so that the user gets immediate feedback as to whether the rearrangement they’ve made improves the tree.

I think this could be a fun teaching tool, and if it supported touch then students could use their phones and tablets to get a sense of how tree building works.

Thursday, September 17, 2015

On having multiple DOI registration agencies for the same journal

On Friday I discovered that BHL has started issuing CrossRef DOIs for articles, starting with the journal Revue Suisse de Zoologie. The metadata for these articles comes from BioStor. After a WTF and WWIC moment, I tweeted about this, and something of a Twitter storm (and email storm) ensued:

To be clear, I'm very happy that BHL is finally assigning article-level DOIs, and that it is doing this via CrossRef. Readers of this blog may recall an earlier discussion about the relative merits of different types of DOIs, especially in the context of identifiers for articles. The bulk of the academic literature has DOIs issued by CrossRef, and these come with lots of nice services that make them a joy to use if you are a data aggregator, like me. There are other DOI registration agencies minting DOIs for articles, such as Airiti Library in Taiwan (e.g., doi:10.6165/tai.1998.43(2).150) and ISTIC (中文DOI) in China (e.g., doi:10.3969/j.issn.1000-7083.2014.05.020) (pro tip, if you want to find out the registration agency for a DOI, simply append it to http://doi.crossref.org/doiRA/, e.g. http://doi.crossref.org/doiRA/10.6165/tai.1998.43(2).150). These provide stable identifiers, but not the services needed to match existing bibliographic data to the corresponding DOI (as I discovered to my cost while working with IPNI).

However, now things get a little messy. From 2015 PDFs for Revue Suisse de Zoologie are being uploaded to Zenodo, and are getting DataCite DOIs there (e.g., doi:10.5281/zenodo.30012). This means that the most recent articles for this journal will not have CrossRef DOIs. From my perspective, this is a disappointing move. It removes the journal from the CrossRef ecosystem at a time when the uptake of CrossRef DOIs for taxonomic journals is at an all time high (both ZooKeys and Zootaxa have CrossRef DOIs), and now BHL is starting to issue CrossRef DOIs for the "legacy" literature (bear in mind that "legacy" in this context can mean articles published last year).

I've rehearsed the reasons why I think CrossRef DOIs are best elsewhere, but the keys points are that articles are much easier to discover (e.g., using http://search.crossref.org), and are automatically first class citizens of the academic literature. However, not everybody buys these arguments.

Maybe a way forward is to treat the two types of DOI as identifying two different things. The CrossRef DOI identifies the article, not a particular representation. The Zenodo DOI (or any DataCite DOI) for a PDF identifies that representation (i.e., the PDF), not the article.

Having CrossRef and Zenodo  DataCite DOIs coexist

This would enable CrossRef and Zenod DOIs to coexist, providing we can (a) have some way of describing the relationship between the two kinds of DOI (e.g., CrossRef DOI - hasRepresentation -> Zenodo DOI).

This would give freedom to those who want the biodiversity literature to be part of the wider CrossRef community to mint CrossRef DOIs to do so. It gives those articles the benefits that come with CrossRef DOIs (findability, being included in lists of literature cited, citation statistics, customer support when DOIs break, altmetrics, etc.)

It would also enable those who want to ensure stable access to the contents of the biodiversity literature to use archives such as Zenodo, and have the benefits of those DOIs (stability, altmetrics, free file storage and free DOIs).

Having multiple DOIs for the same thing is, I'd argue, at the very least, unhelpful. But if we tease apart the notion of what we are identifying, maybe they can coexist. Otherwise I think we are in danger of making choices that, while they seem locally optimal (e.g., free storage and minting of DOIs), may in the long run cause problems and run counter to the goal of making the taxonomic literature has findable as the wider literature.

Friday, September 11, 2015

Possible project: natural language queries, or answering "how many species are there?"

Google Google knows how many species there are. More significantly, it knows what I mean when I type in "how many species are there". Wouldn't it be nice to be able to do this with biodiversity databases? For example, how many species of insect are found in Fiji? How would you answer this question? I guess you'd Google it, looking for a paper. Or you'd look in vain on GBIF, and then end up hacking some API queries to process data and come up with an estimate. Why can't we just ask?

On the face of it, natural language queries are hard, but there's been a lot of work down in this area. Furthermore, there's a nice connection with the idea of knowledge graphs. One approach to natural language parsing is to convert a natural language query to a path in a knowledge graph (or, if you're Facebook, the social graph). Facebook has some nice posts describing how their graph search works (e.g., Under the Hood: Building out the infrastructure for Graph Search), and there's a paper describing some of the infrastructure (e.g., "Unicorn: a system for searching the social graph" doi:10.14778/2536222.2536239, get the PDF here).

Natural language queries can seem potentially unbounded, in the sense that the user could type in anything. But there are ways to constrain this, and ways to anticipate what the user is after. For example, Google suggests what you may be after, which gives us clues as to the sort of questions we'd need answers for. It would be a fun exercise to use Google suggest to discover what questions people are asking about biodiversity, then determine what would it take to be able to answer them.

Suggest All very sensible questions that existing biodiversity databases would struggle to answer.

There's a nice presentation by Kenny Bastani where he tackles the problem of bounding the set of possible questions by first generating the questions for which he answers, then caching those so that the user can select from them (using, for example, a type-ahead interface).

Hence, we could generate species counts for all major and/or charismatic taxa for each country, habitat type (or other meaningful category), then generate the corresponding query (e.g., "how many species of birds are there in Fiji", where the yellow and cyan" terms are the things we replace for each query).

One reason this topic appeals is that it is intimiately linked to the idea of a biodiversity knowledge graph, in that answers to a number of questions in biodiversity can be framed as paths in that graph. Do, if we build the graph we should also be asking about ways to query it. In particular, how do we answer the most basic questions of the information we are aggregating in myriad databases.

Monday, September 07, 2015

Wikidata, Wikipedia, and #wikisci

Last week I attended the Wikipedia Science Conference (hashtag: #wikisci) at the Wellcome Trust in London. it was an interesting two days of talks and discussion. Below are a few random notes on topics that caught my eye.

What is Wikidata?

A recurring theme was the emergence of Wikidata, although it never really seemed clear what role Wikidata saw for itself. On the one hand, it seems to have a clear purpose:
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others.
At other times there was a sense that Wikidata wanted to take any and all data, which it doesn't really seemed geared up to do. The English language Wikipedia has nearly 5 million articles, but there are lots of scientific databases that dwarf that in size (we have at least that many taxonomic names, for example). So, when Dario Taraborelli suggests building a repository of all citations with Wikidata, does he really mean ALL citations in the academic literature? CrossRef alone has 75M DOIs, whereas Wikidata currently has 14.8M pages, so we are talking about greatly expanding the size of Wikidata with just one type of data.

The sense I get is that Wikidata will have an important role in (a) structuring data in Wikipedia, and (b) providing tools for people to map their data to the equivalent topics in Wikipedia. Both are very useful goals. What I find less obvious is whether (and if so, how) Wikidata aims to be a more global database of facts.

How do you Snapchat? You just Snapchat

As a relative outsider to the Wikipedia community, and having had a sometimes troubled experience with Wikipedia, it struck me that how opaque things are if your are an outsider. I suspect this is true of most communities, if you are a member then things seem obvious, if you're not, it takes time to find out how things are done. Wikipedia is a community with nobody in charge, which is a strength, but can also be frustrating. The answer to pretty much any question about how to add data to Wikidata, how to add data types, etc. was "ask the community". I'm reminded of the American complaint about the European Union "if you pick up the phone to call Europe, who do you call?". In order to engage you have to invest time in discovering the relevant part of the community, and then learn engage with it. This can be time consuming, and is a different approach to either having to satisfy the requirements of gatekeepers, or a decentralised approach where you can simply upload whatever you want.


It seems that everything is becoming a stream. Once the volume of activity reaches a certain point, people don't talk about downloading static datasets, but instead of consuming a stream of data (very like the Twitter firehose). The volume of Wikipedia edits means scissile scientists studying the growth of Wikipedia are now consuming streams. Geoffrey Bilder of CrossRef showed some interesting visualisations of real-time streams of DOIs being as users edited Wikipedia pages http://events.labs.crossref.org/events/types/WikipediaCitation, and Peter Murray-Rust of ContentMine seemed to imply that ContentMine is going to generate streams of facts (rather than, say, a query able database of facts). Once we get to the stage of having large, transient volumes of data, all sorts of issues about reanalysis and reproducibility arise.

CrossRef and evidence

One of the other strking visualisations that CrossRef have is the DOI Chronograph, which displays the numbers of CrossRef DOI resolutions by the domain of the hosting web site. In other words, if you are on a Wikipedia page and click on a DOI for an article, that's recorded as a DOI resolution from the domain "wikipedia.org". For the period 1 October 2010 to 1 May 2015 Wikipedia was the source of 6.8 million clicks on DOIs, see http://chronograph.labs.crossref.org/domains/wikipedia.org. One way to interpret this is that it's a measure of how many people are accessing the primary literature - the underlying evidence - for assertions made on Wikipedia pages. We can compare this with results for, say, biodiversity informatics projects. For example, EOL has 585(!) DOI clicks for the period 15 October 2010 to 30 April 2015. There are all sorts of reasons for the difference between these two sites, such as Wikipedia has vastly more traffic than EOL. But I think it also reflects the fact that many Wikipedia articles are richly referenced with citations to the primary literature, and projects like EOL are very poorly linked to that literature. Indeed, most biodiversity databases are divorced from the evidence behind the data they display.

Diversity and a revolution led by greybeards

"Diversity" is one of those words that has become politicised, and attempts to promote "diversity" can get awkward ("let's hear from the women", that homogeneous category of non-men). But the aspect of diversity that struck me was age-related. In discussions that involved fellow academics, invariably they looked a lot like me - old(ish), relatively well-established and secure in their jobs (or post-job). This is a revolution led not by the young, but by the greybeards. That's a worry. Perhaps it's a reflection of the pressures on young or early-stage scientists to get papers into high-impact factor journals, get grants, and generally play the existing game, whereas exploring new modes of publishing, output, impact, and engagement have real risks and few tangible rewards if you haven't yet established yourself in academia.