Monday, April 21, 2014

Is collecting specimens necessary?

Some interesting threads in TAXACOM today (yes, really). The following article has appeared in Science:
Minteer, B. A., Collins, J. P., Love, K. E., & Puschendorf, R. (2014, April 18). Avoiding (Re)extinction. Science. American Association for the Advancement of Science (AAAS). doi:10.1126/science.1250953 (paywall)
The authors argue that "The availability of adequate alternative methods of documentation, including high-resolution photography, audio recording, and nonlethal sampling, provide an opportunity to revisit and reconsider field collection practices and policies."

This has brought a swift response from Kevin Winkler
(Re)affirming the specimen gold standard who argues that physical specimens are vital for much of the science that museum collections support.

At the same time, David Schindel has posted on Minimum standards for e-voucher documentation, that is, DNA samples where no physical voucher exists (e.g., because the organism is a member of an endangered species, or still alive).

Wednesday, April 16, 2014

Breaking the Biodiversity Heritage Library

The Biodiversity Heritage Library (BHL) has recently introduced a feature that I strongly dislike. The post describing this feature (Inspiring discovery through free access to biodiversity knowledge... states:

Now BHL is expanding the data model for its portal to be able to accommodate references to content in other well-known repositories. This is highly beneficial to end users as it allows them to search for articles, alongside books and journals, within a single search interface instead of having to search each of these siloes separately.

What this means is that, whereas in the past a search in BHL would only turn up content actually in BHL, now that search may return results from other sources. What's not to like? Well, for me this breaks the fundamental BHL experience that I've come to rely on, namely:

If I find something in BHL I can read it there and then

With the new feature, the search results may include links to other sources. Sometimes these are useful, but sometimes they are anything but. Once you start including external links in your search results, you have limited control over what those links point to. For example, if I search BHL for the journal Revista Chilena de Historia Natural I get two hits. Cool! So I click on one hit and I can read a fairly limited set of scanned volumes in BHL, if I click on the other hit I'm taken to a page at the Digital Library of the Real Jardín Botánico of Madrid. This is a great resource, but the experience is a little jarring. Worse, for this journal the Real Jardín Botánico doesn't actually have any content, instead the "View Book" link takes me to SciElo in Chile, where I can see a list of recent volumes of this journal.

In this case, BHL is basically a link farm that doesn't give me direct access to content, but instead sends me on a series of hops around the Internet until I find something (and I could have gotten there more quickly via Google).

What is wrong with this?


There are two reasons I dislike what BHL have done. The first is that it breaks the experience of search then read within a consistent user interface. Now I am presented with different reading experiences, or, indeed, no reading at all, just links to where I might find something to read.

More subtlety, it undermines a nice feature of BHL, namely searching by taxonomic names. The content BHL has scanned has also been indexed by taxonomic name, so often I find what I'm looking for not by using bibliographic details (journal name, volume, etc.), which are often a bit messy, but by searching on a name. External content has not been indexed by name, so it can't be found in this way. Whereas before, if I search by name I would be reasonably confident that if BHL had something on that name I could find it (barring OCR errors), now BHL may well have what I looking for (in an external source) but can't show me that because it hasn't been indexed.

From my perpsective, the things I've come to rely on have been broken by this new feature (and I haven't even begun to talk about how this breaks things I rely on to harvest BHL for article metadata, which I then put into BioStor, which in turn gets fed back into BHL).

What should BHL have done?


To be clear, I'm not arguing against BHL being "able to accommodate references to content in other well-known repositories". Indeed, I'd wish they'd go further and incorporate content from BHL-Europe, whose portal is, frankly, a mess. Rather, my argument is that they should not have done this within the existing BHL portal. Doing so dilutes the fundamental experience of that portal ("if I find it I can read it").

Here's what I would do instead:
  1. Keep the current BHL portal as it was, with only content actually scanned and indexed by BHL.
  2. Create a new site that indexes all relevant content (e.g., BHL, BHL-Europe, and other repositories.
  3. Model this new portal on something like CrossRef's wonderful metadata search. That is, throw all the metadata into a NoSQL database, add a decent search engine, and provide users with a simple, fast tool.
  4. The portal should clearly distinguish hits that are to BHL content (e.g. by showing thumbnails) and hits that are to external links (and please filter links to links!).
  5. Add taxonomic names to the index (you have these for BHL content, adding them for external content is pretty easy).
  6. Even more useful, start indexing full text content, maybe starting at articles ("parts"). At the moment Google is doing a better job of indexing BHL content (indirectly via indexing Internet Archive) than BHL does.

Creating a new tool would also give BHL the freedom to explore some new approaches without annoying users like me who have come to rely on the currently portal working in a certain way. Otherwise BHL risks "feature creep", however well motivated.

Thursday, April 10, 2014

User interface to edit a point location

Circle
Following on from earlier posts on annotating biodiversity data (Rethinking annotating biodiversity data and More on annotating biodiversity data: beyond sticky notes and wikis) I've started playing with user interfaces for editing data.

For example, here's a simple interface to edit the location of a specimen or observation (inspired by the iNaturalist observation editor). You can play with this below or on on bl.ocks.org, and the source code is on GitHub https://gist.github.com/rdmpage/9951904.



Tuesday, April 08, 2014

The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine

An undergraduate student (Aime Rankin) doing a project with me on citation and impact of museum collections came across a paper I hadn't seen before:
Strasser, B. J. (2011, March). The Experimenter’s Museum: GenBank, Natural History, and the Moral Economies of Biomedicine. Isis. University of Chicago Press. doi:10.1086/658657


Unfortunately the paper is behind a paywall, but here's the abstract (you can also get a PDF here):

Today, the production of knowledge in the experimental life sciences relies crucially on the use of biological data collections, such as DNA sequence databases. These collections, in both their creation and their current use, are embedded in the experimentalist tradition. At the same time, however, they exemplify the natural historical tradition, based on collecting and comparing natural facts. This essay focuses on the issues attending the establishment in 1982 of GenBank, the largest and most frequently accessed collection of experimental knowledge in the world. The debates leading to its creation—about the collection and distribution of data, the attribution of credit and authorship, and the proprietary nature of knowledge—illuminate the different moral economies at work in the life sciences in the late twentieth century. They offer perspective on the recent rise of public access publishing and data sharing in science. More broadly, this essay challenges the big picture according to which the rise of experimentalism led to the decline of natural history in the twentieth century. It argues that both traditions have been articulated into a new way of producing knowledge that has become a key practice in science at the beginning of the twenty-first century.

It's well worth a read. It argues that sequence databases such as Genbank are essentially the equivalent of the great natural history museums of the 19th Century. There are several ironies here. One is that some early advocates of molecular biology cast it as a modern, experimental science as opposed to mere natural history. However, once the amount of molecular data became too great for individuals to easily manage, and once it became clear that many of the questions being asked required a comparative approach, the need for a centralised database of sequences (the "experimenter's museum" of the title of the paper) became increasingly urgent. Another irony is that the clash between molecular and morphological taxonomy overlooks these striking similarities in history (collecting ever increasing amounts of data eventually requiring centralisation).

Bruno Strasser's article also discusses the politics behind setting up GenBank, including the inevitable challenge of securing funding, and the concerns of many individual scientists about the loss of control over their data. A final irony is that, having gone through this process once with the formation of the big museums in the 19th century, we are going through it again with the wrangling over aggregating the digitised versions of the content of those museums.

Update: See also
Strasser, B. J. (2008, October 24). GENETICS: GenBank--Natural History in the 21st Century? Science. American Association for the Advancement of Science (AAAS). doi:10.1126/science.1163399
(via Guanyang Zhang).

Friday, April 04, 2014

More on annotating biodiversity data: beyond sticky notes and wikis

Following on from the previous post Rethinking annotating biodiversity data, here are some more thoughts on annotating biodiversity data.

Annotations as sticky notes


I get the sense that most people think of annotations as "sticky notes" that someone puts on data. In other words, the data is owned by somebody, and anyone who isn't the owner gets to make comments, which the owner is free to use or ignore as they see fit. With this model, the focus is on how the owner deals with the annotations, and how they manage the fact that their data may have changed since the annotations were made.

This model has limitations. For a start, it privileges the "owner", and puts annotators at their mercy. For example, I posted an issue regarding a record in the Museum of Comparative Zoology Herpetology database (see https://github.com/mcz-vertnet/mcz-subset-for-vertnet/issues/1). VertNet has adopted GitHub to manage annotations of collection data, which is nice, but it only works if there's someone at the other end ready to engage with people like me who are making annotations. I suspect this is mostly not going to be the case, so why would I bother annotating the data? Yes, I know that VertNet has only just set this up, but that's missing the point. Supporting this model requires customer support, and who has the resources for that? If I don't get the sense that someone is going to deal with my annotation, why bother?

So, the issues here are that the owner gets all the rights, the annotators have none, and in practice the owners might not be in a position to make use of the annotations anyway.

Wikis


OK, if the owner/annotator model doesn't seem attractive, what about wikis? Let's put the data on a wiki and let folks edit it, that'll work, right? There's a lot to be said in favour of wikis, but there's a disadvantage to the basic wiki model. On a wiki, there is one page for an item, and everyone gets to edit that same page. The hope is that a consensus will emerge, but if it doesn't then you get edit wars (e.g., When taxonomists wage war in Wikipedia). If you've made an edit, or put your data on a wiki, anyone can overwrite it. Sure, you can roll back to an earlier version, but so can anyone else.

Wikis bring tools for community editing, but overturn ownership completely, so the data owner, or indeed any individual annotator has no control over what happens to their contributions. Why would an expert contribute if someone else can undo all their hard work?

Social data


So, if sticky notes and wikis aren't the solution, what is? I've been looking at Fluidinfo lately. There's an interview here, and a book here. The company has gone quiet lately (apparently focussing on enterprise customers), but what matters here is the underlying idea, namely "social data".

Fluidinfo's model is that it is a database of objects (representing things or concepts), and anyone can add data to those objects (they are "openly writable"). The key is that every tag is linked to the user, and by default you can only add, edit, or delete your own tags. This means that if a data provider adds, say a bibliographic reference to the database, I can edit it by adding tags, but I can't edit the data provider's tags. To make this a bit more concrete, suppose we have a record for the article with the DOI 10.1163/187631293X00262. We can represent the metadata from CrossRef like this:

{
"_id": "10.1163/187631293X00262",
"crossref/doi" : "10.1163/187631293X00262",
"crossref/title" : "A taxonomic review of the pondskater...",
"crossref/journal" : "Insect Systematics & Evolution",
"crossref/issn" : [ "1399-560X", "1876-312X"]
}

Note the use of the namespace "crossref" in the tags. This is data that, notionally, CrossRef "owns" and can edit, and nobody else. Now, as I've discussed earlier (Orwellian metadata: making journals disappear) some publishers have an annoying habit of retrospectively renaming journals. This article was published in Entomologica Scandinavica, which has since been renamed Insect Systematics & Evolution, and CrossRef gives the latter as the journal name for this article. But most citations to the article will use the old journal name. Under the social data model, I can add this information (in bold):

{
"_id": "10.1163/187631293X00262",
"crossref/doi" : "10.1163/187631293X00262",
"crossref/title" : "A taxonomic review of the pondskater...",
"crossref/journal" : "Insect Systematics & Evolution",
"crossref/issn" : ["1399-560X", "1876-312X"],
"rdmpage/journal" : "Entomologica Scandinavica","rdmpage/issn" : ["0013-8711" ]
}

My tags have the namespace "rdmpage", so they are "mine". I haven't overwritten the "crossref" tags. Somebody else could add their own tags, and of course, CrossRef could update their tags if they wish. We can all edit this object, we don't need permission to do so, and we can rest assured that our own edits won't be overwritten by somebody else.

This model can be quite liberating. If you are a data provider/owner, you don't have to worry about people trampling over your data, because you (and any users of your data) can simply ignore tags not in your namespace ("ignore those rdmpage' tags, that Rod Page chap is clearly a nutter"). Annotators are freed from their reliance on data providers doing anything with the annotations they created. I don't care whether CrossRef decides to revert the journal name Insect Systematics & Evolution to Entomologica Scandinavica for earlier article (or not), I can just use the "rdmpage/journal" (if it exists) to get what I think is the appropriate journal name. My annotations are immediately usable. Because everyone gets to edit in their own namespace, we don't need to form a consensus, so we don't need the version control feature of wikis to enable roll backs, there are no more edit wars (almost).

Implementation


A key feature of the Fluidinfo social data model is that the data is stored in a single, globally accessible place. Hence we need a global annotation store. Fluidinfo itself doesn't seem to have a publicly accessible database, I guess in part because managing one is a major undertaking (think Freebase). Despite Nicholas Tollervey's post (FluidDB is not CouchDB (and FluidDB's secret sauce)), I think CouchDB is exactly the way I'd want to implement this (it's here, it works, and it scales). The "secret sauce" is essentially application logic (every key has a namespace corresponding to a given user).

The more I think about this model the more I like it. It could greatly simplify the task of annotating biodiversity data, and avoid what I fear are going to be the twin dead ends of sticky note annotation and wikis.

Monday, March 31, 2014

Rethinking annotating biodiversity data

TL;DR By using bookmarklets and a central annotation store, we can build a system to annotate any biodiversity database, and display those annotations on those databases.

A couple of weeks ago I was at GBIF meeting in Copenhagen, and there was a discussion about adding a new feature to the GBIF portal. The conversation went something like this:

Advisor: "We really need this feature, now!"

Developer: "OK, but which of these other things you've told us we need to do should we stop doing, so we can add this new feature?"

Resources are limited, and adding new features to a project can be difficult. This got me thinking about the issue of annotating data, in GBIF and other biodiversity projects. There have been a number of recent papers on annotating biodiversity data, such as:

Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013, November 4). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0076093
Tschöpe, O., Macklin, J. A., Morris, R. A., Suhrbier, L., & Berendsohn, W. G. (2013, December 20). Annotating biodiversity data via the Internet. Taxon. International Association for Plant Taxonomy (IAPT). doi:10.12705/626.4

It seems to me that these potentially suffer the assumption that data aggregators such as GBIF, and data providers such as natural history collections have sufficient resources in place to (a) implement such systems, and (b) process the annotations made by the community and update their records. What if neither assumption holds true?

Everyone is busy


Any system which requires a project to add another feature is going to have to compete with other priorities. I ran into this with my BioNames project, which was partly funded by EOL. BioNames links taxonomic names for animals (obtained from ION) to the primary literature, for example Pinnotheres atrinicola was published in the following paper:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904.

Ideally, all the links between names and publications that I'd assembled in BioNames would have been added to EOL, so that (wherever possible) users of EOL could see the original description of a taxon in EOL. But this didn't happen. In order to get BioNames into EOL I had to export the data in Darwin Core format, which is poorly suited to this kind of data. It also became clear that BioNames and EOL had rather different data models when it came to taxa, names, and publications. This meant it was going to be challenge providing the data in w ay that was usable by EOL. Plus, EOL was pretty busy doing other things such as developing TraitBank™ (yes, that's a "™" after TraitBank). So, I never did get BioNames content into EOL.

But there's another way to do this.

The Web means never having to ask for permission


It occurred to me (around about the time that I was at the pro-iBiosphere hackathon at Leiden) that there's another way to tackle this, a way which uses bookmarklets. Bookmarklets are little snippets of Javascript that can be stored as bookmarks in your web browser, and they can add extra functionality to an existing web page. You may well have come across these already, such as Save to Mendeley , or Altmetric it.

How does this help us with annotation? Well, with a little programming, you can add features that you think are "missing" from a web page, and you don't need to ask anyone's permission to do it. So, I could negotiate with EOL about how to get data from BioNames into EOL, or I can simply do this:

2014 03 31 17 05 26

What I've done here is create a bookmarklet that recognises that you are looking at an EOL page, it then calls the BioNames API and displays the original publication of the taxon displayed on the page (in this case, Pinnotheres atrincola). So, I've added the information from BioNames to the EOL page, without needing EOL to do anything.

But it gets better. We can do this with pretty much any web page. The example above displays the original publication of a taxon name, but imagine we are looking at the publisher's page for that article (you can see it here: http://dx.doi.org/10.1080/03014223.1983.10423904). Wouldn't it be nice if the publisher knew that this paper described a new species of crab? We could negotiate with the publisher about how to give them that information, and how they could display it, or we can just add it:

2014 03 31 17 15 08

This time the bookmarklet recognises that the web page has a DOI, then asks BioNames whether there have been any names published in the paper with that DOI, if it finds any they are displayed in the popup.

Bookmarklets enable you to enhance a web page with any information you like. This makes them ideal for displaying annotations on a page. If you want to try yourself, you can grab the bookmarklet from here.

Making annotations visible


Bookmarklets can be used to solve one part of the annotation problem, namely showing existing annotations. I have lots of exmaples of errors in datasets, I blog about some of these, I store some in Evernote for future reference, some end up in unfinished manuscripts, and so on. The problem is that these annotations are of little use to anyone else because if you go to GBIF you don't see my annotations (or, indeed, anyone else's). But we can use a bookmarklet to display these, without having to pester GBIF themselves to add this feature! Imagine a bookmarklet that you could click on and it shows you whether anyone as queried the identification of a specimen, or the location of a specimen?

Where do the annotations come from?


Of course, all this presupposes that we have annotations to start with. I think there are at least two classes of annotations. The first, most obvious annotations are ones that change or add attributes to an object. For example, adding latitude and longitude coordinates to a specimen. These are annotations we would want to display just on the corresponding page in the source database (e.g., displaying a map in the annotation popup on GBIF for a record we've georeferenced).

The second class comprise cross-links between data sets. For example, linking a species in EOL to DOI of the publication that first described that species. Or linking a specimen in GBIF to the sequences in GenBank that were obtained from that specimen. These annotations are different in that we might want to display them on multiple web pages (e.g., pages served by both a biodiversity database and an academic publisher). From this perspective, a database like BioNames is essentially a big store of annotations.

But we need more than this, we need to be able to annotate any class of data that is relevant to biodiversity. We need to be able to edit erroneous GBIF records, flag Genbank sequences that have been misidentified, document taxonomic names that are entirely spurious, and so on. And we need to make these annotations available via APIs so that anywone can access them. To me, it seems obvious that we need a single, centralised annotation store.

A global annotation store


One way to implement an annotation store would be to create a wiki-style database that the community could edit. This database gets populated with data that can then be edited, refined, and discussed. For example, imagine a GBIF user spots an occurrence that is clearly wrong (a frog in the middle of the ocean). They could have a bookmarklet that they click on, and it displays any existing annotations of that record. If there aren't any, let's imagine there is a link to the annotaion store. Clicking on that creates a record for that occurrence, and the user then edits that. Perhaps they discover that the latitude and longitudes have been swapped, so they swap them back, and save the record. The next person to go to that page in GBIF clicks on their bookmarklet and discoveres that there is a potential issue with that record (the popup displayed by the bookmarklet will have a "warning symbol", and an updated map).

Some annotations will be simple, some may require some analysis. For example, a claim that a GenBank sequence has been misidentified would be stronger if it was backed up by a BLAST analysis that demonstrated that the sequence was clustered with taxa that you would not expect based on its putitative identification.

We can also annotate in bulk, and upload these annotations directly to the annotation store. For example, we could map GBIF taxa to taxonomic name identifiers from nomenclators such as ION, ZooBank, IPNI, Index Fungorum, etc., then map those identifiers to the primary litertaure, and upload all of that data to the annotation store, making them available to anyone visiting GBIF (or, indeed, the nomenclators). We could BLAST DNA barcode sequences and suggest potential identifications. We could add lists of publications that cite museum specimen codes, and display those on the GBIF page that corresponds to each code. There is almost no limit to the richness of annotations we could add to existing webpages.

Filtered push


One aspect of annotation that I've glossed over is how the annotations get back to the primary data providers. There has been some work on this (see papers cited at the start), but in a sense I don't think this is the most pressing problem (in part because I suspect most providers are in no position to undertake the kind of data editing and cleaning required). My concern is at the other end of the process. Users of biodiversity data are frequently presented with data that is demonstrably erroneous, and it inconveniences them, as well as hurting the reputation of aggregators such as GBIF, or databases such as GenBank. Anyone doing an analysis of these sorts of data will spend some time cleaning and correcting the data, we desperately need mechanisms to capture these annotations and make them available to other users. The extent to which these annotations filter back to the primary data providers is, in my view, a less pressing issue.

That said, a central annotation store would have lots of advantages for primary providers. It's one place to go to get annotations. The fate of a user's edits could help develop metrics of reliability of annotations, and so on.

Summary


The reason I find this approach attractive is that it frees us from having to wait for projects like GBIF and GenBank to support annotations. We don't need to wait, we can simply do it ourselves right now. We can add overlays that augment existing data (e.g., adding original publications to EOL web pages), or flag errors. Take the example bookmarklet from here for a spin and see what it can do. It's very crude, but I think it gives an indication of the potential of this approach.

So, "all" we need is a centralised, editable, database of annotations that we can hook the bookmarklet into. Simples.

Thursday, March 13, 2014

Publishing biodiversity data directly from GitHub to GBIF

GoogleEarth Image
Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

The data I uploaded came from this paper:

Shapiro, L. H., Strazanac, J. S., & Roderick, G. K. (2006, October). Molecular phylogeny of Banza (Orthoptera: Tettigoniidae), the endemic katydids of the Hawaiian Archipelago. Molecular Phylogenetics and Evolution. Elsevier BV. doi:10.1016/j.ympev.2006.04.006
This is the data I used to build the geophylogeny for Banza using Google Earth. Prior to uploading this data, GBIF had no georeferenced localities for these katydids, now it has 21 occurrences:

DatasetHow it works

I give details of how I did this in the GitHub repository for the data. In brief, I took data from the appendix in the Shapiro et al. paper and created a Darwin Core Archive in a repository in GitHub. Mostly this involved messing with Excel to format the data. I used GBIF's registry API to create a dataset record, pointed it at the GitHub repository, and let GBIF do the rest. There were a few little hiccups, such as needing to tweak the meta.xml file that describes the data, and GBIF's assumption that specimens are identified by the infamous "Darwin Core Triplet" meant I had to invent one for each occurrence, but other than that it was pretty straightforward.

I've talked about using GitHub to help clean up Darwin Core Archives from GBIF, and VertNet are using GitHub as an issue tracker, but what I've done here differs in one crucial way. I'm not just grabbing a file from GBIF and showing that it is broken (with no way to get those fixes to GBIF), nor am I posting bug reports for data hosted elsewhere and hoping that someone will fix it (like VertNet), what I'm doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

Beyond nodes

GBIF's default publishing model is a federated one. Data providers in countries (such as museums and herbaria) digitise their data and make it available to national aggregators ("nodes"), which typically host a portal with information about the biodiversity of that nation (the Atlas of Living Australia is perhaps the most impressive example). These nodes then make the data available to GBIF, which provides a global portal to the world's biodiversity data (as opposed to national-level access provided by nodes).

This works well if you assume that most biodiversity data is held by national natural history collections, but this is debatable. There are other projects, some of them large and not necessarily "national" that have valuable data. These projects can join GBIF and publish their data. But what about all the data that is held in other databases (perhaps not conventionally thought of as biodiversity databases), or the huge amount of information in the published literature. How does that get into GBIF? People like me who data mine the literature for information on specimens and localities, such as this map of localities mentioned in articles in BioStor. How do we get that data into GBIF?

BiostorData publishing

Being able to publish data directly to GBIF makes putting the effort into publishing data seem less onerous, because I can see it appear in GBIF within minutes. Putting 21 records of katydids is clearly a drop in the ocean, but there is potentially vastly more data waiting to be mined. managing the data on GitHub also makes the whole process of data cleaning and edit transparent. As ever, there are a couple of things that still need to be tackled.

It's who you know

I've been able to do this because I have links with GBIF, and they have made the (hopefully reasonable) assumption that I'm not going to publish just any old crap to GBIF. I still had to get "endorsed" by the UK node (being the chair of the GBIF Science Committee probably helped), and I'm lucky that Tim Roberston was online at the time and guided me through the process. None of this is terribly scalable. It would be nice if we had a way to open up GBIF to direct publishing, but also with a review process built in (even if it's a post-review so that data may have to be pulled if it becomes clear it's problematic). Perhaps this could be managed via GitHub, for example data could be uploaded and managed there, and GBIF can then choose to pull that repository and the data would appear on GBIF. Another model is something like the Biodiversity Data Journal, but that doesn't (as far as I know) have a direct feed into GBIF.

Whichever approach we take, we need a simple, frictionless way to get data into GBIF, especially if we want to tackle the obvious geographic and taxonomic biases in the data GBIF currently has.

DOIs please

It would be great if I could get a DOI for this data set. I had toyed with putting it on Figshare which would give me a DOI, but that just puts an additional layer between GitHub and GBIF. Ideally instead (or as well as) the UUID I get from GBIF to identify the dataset, I'd get a DOI that others can cite, and which would appear on my ORCID profile. I'd also want a way to link the data DOI to the DOI for the source paper (doi:10.1016/j.ympev.2006.04.006), so that citations of the data can pass some of that "link love" to the original authors. So, GBIF needs to mint DOIs for datasets.

Summary

The ability to upload data to GitHub and then have that harvested by GBIF is really exciting. We get great tools for managing changes in data, with a simple publication process (OK, simple if you know Tim, and can speak REST to the GBIF API). But we are getting closer to easy publishing and, just as importantly, easy editing and correcting data.