Thursday, May 26, 2016

Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library

Given that Wikipedia, Wikidata, and the Biodiversity Heritage Library (BHL) all share the goal of making information free, open, and accessible, there seems to be a lot of potential for useful collaboration. Below I sketch out some ideas.

BHL as a source of references for Wikipedia

Wikipedia likes to have sources cited to support claims in its articles. BHL has a lot of articles that could be cited by Wikipedia articles. By adding these links, Wikipedia users get access to further details on the topic of interest. BHL also benefits from greater visibility resulting from visits from Wikipedia readers.

In the short term BHL could search Wikipedia for articles that could benefit from links to BHL (see below). In the long term as more and more BHL articles get DOIs this will become redundant as Wikipedia authors will discover articles via CrossRef.

There are various ways to search Wikipedia to get a sense of what links could be added. For example, you can search the Wikipedia API for pages that link to a particular web domain (see https://www.mediawiki.org/wiki/API:Lists/All#Exturlusage). Here's a search for articles linking to biostor.org https://en.wikipedia.org/w/api.php?action=query&list=exturlusage&euquery=biostor.org&eulimit=20.

A quick inspection suggests that many of these links could be improved (for example, some have outdated links to PDFs and not to the article), so we can locate Wikipedia articles that could be edited. It is likely that Wikipedia articles that have one link to BHL or BioStor may have other citations that could be linked.

Wikipedia as a source of content

One of the big challenges facing BHL is extracting articles from its content. My own BioStor is one approach to tackling this problem. BioStor takes citation details for articles and attempts to locate them in BHL - the limiting factor is access to good-quality citation data. Wikipedia is potentially an untapped source of citation data. Each page that uses the "Cite" template could be mined for citations, which in turn could be used to locate articles. Wikipedia pages using the Cite template can be found via the API, e.g. https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Cite&eilimit=20&format=json. Alternatively, we could mine particular types of pages (e.g., those on taxa or taxonomists), or mine Wikispecies (which doesn't use the same citation formatting as Wikipedia).

Wikidata as a data store

If Wikidata aims to be a repository of all structured data relevant to Wikipedia, then this includes bibliographic citations (see WikiCite 2016 ), hence many articles in BHL will end up in Wikidata. This has some interesting implications, because Wikidata can model data with more fidelity than many other sources of bibliographic information. For example, it supports multiple languages as well as multiple representations of the sample language - the journal Acta Herpetologica Sinica https://www.wikidata.org/wiki/Q24159308 in Wikidata has not only the Chinese title (兩棲爬行動物學報) but the pinyin transliteration "Liangqi baxing dongwu yanjiu". Rather than attempt to replicate a community-editable database, Wikidata could be the place to manage article and journal-related metadata.

Disambiguating people

As we move from "strings to things" we need to associate names for things with identifiers for those things. I've touched on this already in Possible project: mapping authors to Wikipedia entries using lists of published works. Ideally each author in BHL would be associated with a globally unique identifier, such as ORCID or ISNI. Contributors to Wikipedia and Wikidata have been collecting these for individuals with Wikipedia articles. If those Wikipedia pages have links to BHL content then we can semi-automate the process of linking people to identifiers.

Caveats

There are a couple of potential "gotchas" concerning Wikipedia and BHL. The licenses used for content are different, BHL is typically CC-BY-NC whereas Wikipedia is CC-BY. The "non commercial" restriction used by BHL is a deal-breaker for sharing content such as page images with Wikicommons.

Wikipedia and Wikidata are communities, and I've often found this makes it challenging to find out how to get things done. Who do you contact to make a decsioon about some new feature you'd like to add? It's not at all obvious (unless you're a part of that community). Existing communities with accepted practices can be resistant to change, or may not be convinced that what you'd like to do is a benefit. For example, I think it would be great to have a Wikipedia page for each journal. Not everyone agrees with this, and one can expend a lot of energy debating the pros and cons. The last time I got seriously engaged with Wikipedia I ended up getting so frustrated I went off in a huff and built my own wiki. This is where a "Wikipedian in residence" might be helpful.

Notes on current and future projects

I'll be taking a break shortly, so I thought I'd try to gather some thoughts on a few projects/ideas that I'm working on. These are essentially extended notes to myself to jog my memory when I return to these topics.

BOLD data into GBIF

Following on from work on getting mosquito data into GBIF I've been looking at DNA barcoding data. BOLD data is mostly absent from GBIF. The publicly available data can be downloaded, and is in a form that could be easily ingested by GBIF. One problem is that the data is incomplete, and sometimes out of date. BOLD's data dumps and BOLD's API use different formats (sigh), and the API returns additional data such as image soft voucher specimens. Most data in the data dumps are not identified to species, so they will have limited utility for most GBIF users.

One approach would be to take the data dumps as the basic data, then use the API to enhance that data, such as adding image links. If the API returns a species-level identification for a barcode then that could be added as an identification using the corresponding Darwin Core extension. In this way we could treat the data as an evolving entity, which it is as our knowledge of it improves. For a related example see Leafcutter bee new to science with specimen data on Canadensys where Canadensys record two different identifications of some bee specimens as research showed that some specimens represented a new species.

This work reflects my concern that GBIF is missing a lot of data outside its normal sources. The mechanism for getting data into GBIF is pretty bureaucratic and could do with reforming (or, at least provision of other ways to add data).

BOLD by itself

I've touched on this before (Notes on next steps for the million DNA barcodes map), I'd really like to do something better with the way we display and interact with DNA barcode data. This will need some thought on calculating and visualising massive phylogenies, and spatial queries that return subtrees. I can't help thinking that there's scope for some very cool things in this area. If nothing else, we can do interesting things without getting involved in some of the pain of taxonomic names.

Big trees

Viewing big trees is still something of an obsession. I still think this hasn't been solved in a way that helps us learn about the tree and the entities in that tree. I currently think that a core problem to solve is how to cluster or "fold" a tree in a sensible way to highlight the major groups. I did something rather crude here, other approaches include "Constructing Overview + Detail Dendrogram-Matrix Views" (doi:10.1109/TVCG.2009.130, PDF here).

Graph databases and the biodiversity knowledge graph

I'm working on a project to build a "biodiversity knowledge graph" (see doi:10.3897/rio.2.e8767). In some ways this is recreating (my entry in Elsevier's Grand Challenge "Knowledge Enhancement in the Life Sciences", see also hdl:10101/npre.2009.3173.1 and doi:10.1016/j.websem.2010.03.004).

Currently I'm playing with Neo4J to build the graph from JSON documents stored in CouchDB. Learning Neo4J is taking a little time, especially as I'm creating nodes and edges on the fly and want to avoid creating more than one node for the same thing. In a world of multiple identifiers this gets tricky, but I think there's a reasonable way to do this (see the graph gist Handling multiple identifiers). Since I'm harvesting data I'm ending up building a web crawler, so I need to think about queues, and ways to ensure that data added at different times gets properly linked.

Wikipedia and wikidata

I occasionally play with Wikipedia and Wikidata, although this is often an exercise in frustration as editing Wikipedia tends to result in edit wars ("we don't do things that way"). Communities tend to be conservative. I'll write up some notes about ways Wikipedia and Wikidata can be useful, especially in the context of the Biodiversity Heritage Library (see also Possible project: mapping authors to Wikipedia entries using lists of published works).

All the names

The database of all taxonomic names remains as elusive as ever -- our field should be deeply embarrassed by this, it's just ridiculous.

My own efforts in this area involve (a) obtaining lists of names, by whatever means available, and (b) augmenting them to include links to the primary literature. I've made some this work publicly accessible (e.g., BioNames). I'd like all name databases to make their data open, but most are resistant to the idea (some aggressively so).

One approach to this is to simply ignore the whimpering and make the data available. Another is to consider recreating the data. We have name finding algorithms, and more of the literature is becoming available, either completely open (e.g., BHL) or accessible to mining (see Content Mine). At some point we will be able to recreate the taxonomic name databases from scratch, making the obstinate data providers not longer relevant.

First descriptions

Names, by themselves, are not terribly useful. But the information that hangs off them is. it occurs to me that projects like BioNames (and other things I've been working on such as IPNI names) aren't directly tackling this. Yes, it's nice to have a bibliographic citation/identifier for the original description of a name (or subsequent name changes), but what we'd really like is to be able to (a) see that description and (b) have it accessible to machines. So one thing I plan to add to BioNames automate going from name to the page with the actual description, and display this information. For many names BioNames knows the page number of the description, and hence it's location within the publication. So we need to simply pull out that page (allowing for edge cases where the mapping between digital and physical pages might not be straight forward) and display it (together with text). If we have XML we can also try and locate the descriptions within the text (for some experiments using XSLT see https://github.com/rdmpage/ipni-names/tree/master/pmc ).There's lots of scope for simple text mining here, such as specimen codes (such as type specimens) and associated taxonomic names (synonyms, ecologically associated organisms, etc.).

Dockerize all the things!

Lots of scope to explore using container to provide services. Docker Hub provides Elastic Search and Neo4J, and Bitnami can run Elastic Search and CouchDB on Google's cloud. Hence we can play with various tools without having to install them. The other side is creating containers to package various tools (Global Names is doing this), or using containers to package up particular datasets and the tools needed to explore them. So much to learn in this area.

Wednesday, May 11, 2016

Scott Federhen

ImagesAwoke this morning to the sad news (via Scott Miller) that Scott Federhen of the NCBI had died. Anyone using the NCBI taxonomy is a beneficiary of Scott's work on bring together taxonomy and genomic data.

Scott contributed both directly and indirectly to this blog. I reported on some of his work linking taxa in NCBI to sequences from type material (NCBI taxonomy database now shows type material), Scott commented on "dark taxa" and DNA barcoding (e.g., Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)), and was an author on a guest post on "Putting GenBank Data on the Map" (in response to http://dx.doi.org/10.1126/science.341.6152.1341-a). He was also very helpful when I wanted to make links between the NCBI taxonomy and Wikipedia in my iPhylo Linkout project (see http://dx.doi.org/10.1371/currents.RRN1228).

I didn't know Scott well, but always enjoyed chatting to him at meetings (most recently the 6th International Barcode of Life Conference at Guelph). He wasn't shy about putting forth his views, or sharing his enthusiasm for ideas. Indeed, last time we met he was handing out paper copies of his preprint "Replication is Recursion; or, Lambda: the Biological Imperative" (available on bioRχiv http://dx.doi.org/10.1101/018804). He then followed up by sending me a t-shirt with the "replication is recursion" logo printed on one side and some penguins on the other (if I remember correctly this was designed by a member of his family). I delight in baffling students by wearing it sometimes when I lecture.

A number of people in the bioinformatics and biodiversity informatics communities are in shock this morning, this is obviously as nothing compared to what his family must be going through.

Tuesday, May 10, 2016

Notes on next steps for the million DNA barcodes map

Some notes to self about future directions for the "million DNA barcodes map" http://iphylo.org/~rpage/bold-map/.

Screenshot 2016 05 10 13 52 09

At the moment we have an interactive map that we can pan and zoom, and click on a marker to get a list of one or more barcodes at the location. We can also filter by major taxonomic group. Here are some ideas on what could be next.

Search

At the moment search is simply browsing the map. It would be handy to be able to enter a taxon or a barcode identifier and go to the corresponding markers on the map.

What is this?

If we have a single DNA barcode I immediately want to know "what is this?" A picture may help, and I may look up the scientific name in BioNames, but perhaps the most obvious thing to do is get a phylogeny for that barcode and similar sequences. These could then be displayed on the map using the technique I described in Visualising Geophylogenies in Web Maps Using GeoJSON (see also http://dx.doi.org/10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e).

So, ideally we would:

  1. Display information about that barcode (e.g., taxonomic identification where known).
  2. Display the local phylogeny of barcodes that contains this barcode.
  3. Display that phylogeny on the map
Hence we need to be able to generate a local phylogeny of barcodes, either on the fly (retrieve similar sequences then build tree) or using a precompute global barcode phylogeny from which we pull out the local subtree.

What is there?

A question that the map doesn't really answer is "what is the diversity of a given area?". Yes there are lots of dots, and you can click on them, but what would be nice is the ability to draw a polygon on the map (like this) and get a summary of the phylogenetic diversity of barcodes within that area.

100144 drummondFor example, imagine drawing a polygon around Little Barrier Island in New Zealand. Can we effectively retrieve the data published by Drummond et al. ( Evaluating a multigene environmental DNA approach for biodiversity assessment DOI:10.1186/s13742-015-0086-1)?.

To support "what is there?" queries we need to be able to:

  1. Draw an arbitrary spatial region region on the map and retrieve a set of sequences found within that region
  2. Retrieve the phylogeny for that set of sequences
Once agin, we either need to be able to build a phylogeny for an arbitrary set of sequences on the fly, or extract a subtree. If the a global tree is available, we could compute the length of the subtree, and also compute a visual layout fairly easily (essentially with time proportional to the number of sequences).

We'd also need to decide on the best way to visualise the phylogeny for the set of sequences. Perhaps something like Krona, or something more traditional.

Screen phymmbl

Summary

There doesn't seme to be any way of getting away from the need for a global phylogeny of COI DNA barcodes if I want to extend the functionality of the map.

State of open knowledge about the World's Plants

A1BHupvRKew has released a new report today, entitled the State of the World's Plants, complete with it's own web site https://stateoftheworldsplants.com. Its aim:

...by bringing the available information together into one document, we hope to raise the profile of plants among the global community and to highlight not only what we do know about threats, status and uses, but also what we don’t. This will help us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing.

This is, of course, a laudable goal, and a lot of work has gone into this report, and yet there are some things about the report that I find very frustrating.

  1. PDF but no ePub It's nice to have an interactive web site as well as a glossy PDF, but why restrict yourself to a PDF? Why not an ePub so people can view it and rescale fonts for their device, etc. Why not provide the original text in a form people can translate? The report states that much of the newly discovered plant biodiversity is found in Brazil and China, why not make it easier to support automatic translation into Portuguese and Chinese?
  2. Why no DOI for the report? If this is such an important document, why doesn't it have a DOI so it can be easily cited?
  3. Why no DOIs for cited literature? The report cites 219 references, very few of them are accompanied by a DOI, yet most of the references have them. Why not include the DOI so readers can click on that and go straight to the literature. Surely you want to encourage readers to engage with the subject by reading more? The whole point of having digital documents online is that they can link to other documents.
  4. No open access taxonomy Sadly the examples of exciting new plant species discovered are all in closed access publications, including The Gilbertiodendron species complex (Leguminosae: Caesalpinioideae), Central Africa DOI:10.1007/s12225-015-9579-4 published in Kew's own journal Kew Bulletin. This article costs $39.95 / €34.95 / £29.95 to read. Why do taxonomists continue to publish their research, often about taxa in the developing world, behind paywalls?
  5. Why is the data not open? Much of the section on "Describing the world’s plants" uses data from Kew's database IPNI. This database is not open, so how does the reader verify the numbers in the report? Or, more importantly, how does the reader explore the data further and ask questions not asked in the report?

These may seem like small issues given the subject of the report (the perilous state of much of the planet's biodiversity), but if we are to take seriously the goal of "help[ing] us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing" then I suggest that open access to knowledge about plant diversity is a key part of that goal.

Over a decade ago Tom Moritz wrote of the need for a "biodiversity commons": DOI:10.1045/june2002-moritz

Provision of free, universal access to biodiversity information is a practical imperative for the international conservation community — this goal should be accomplished by promotion of the Public Domain and by development of a sustainable Biodiversity Information Commons adapting emergent legal and technical mechanisms to provide a free, secure and persistent environment for access to and use of biodiversity information and data. - "Building the Biodiversity Commons" DOI:10.1045/june2002-moritz

The report itself alludes to the importance of "opening up of global datasets with long-time series (such as maps of forest loss)", and yet botany has been slow to do this for much of its data (see Why are botanists locking away their data in JSTOR Plant Science?). We need data on plant taxonomy, systematics, traits, sequences, and distribution to be open and freely available to all, not closed behind paywalls or limited access APIs. Indeed, Donat Agosti has equated copyright to biopiracy (Biodiversity data are out of local taxonomists' reach DOI:10.1038/439392a.

It would be nice to think that Kew, as well as leading the way in summarising the state of the world's plants, would also be leading the way in making that knowledge about those plants open to all.

Wednesday, April 27, 2016

Possible project: Biodiversity dashboard

Mattern 1 dashboard 1020x703 Despite the well deserved scepticism about dashboards voiced by Shannon Mattern @shannonmattern (see Mission Control: A History of the Urban Dashboard, I discovered this by reading Ignore the Bat Caves and Marketplaces: lets talk about Zoning by Leigh Dodds @ldodds) I'm intrigued by the idea a dashboard for biodiversity. We could have several different kinds of information, displayed in a single place.

Immediate information

There are sites such as Global Forest Watch Fires that track events that affect biodiversity and which are haoppoening right now. Some of this data can be harvested (e.g., from the NASA Fire Information for Resource Management System) to show real-time forest fires. Below is an image for the last 24 hours:

We could also have Twitter feeds of these sorts of events

Historical trends

We could have longer-term trends, such as changes in forest cover, or changes in abundance of species over time.

Trends in information

We could have feeds that show us how our knowledge is changing. For example, we could have a map of data from the newest datasets uploaded to GBIF, the lastest DNA barcodes, etc.

As an example, @wikiredlist tweets overtime an article about a species from the IUCN Red List is edited on the English language Wikipedia.

Imagine several such streams, both as lists and as maps. As another example, a while ago I created a visualisation of new species discoveries:

Summary

I'm aware of the irony of drawing inspiration from a critique of dashboards, but I still think there is value in having an overview of global biodiversity. But we shouldn't loose site of the fact that such views will be biassed and constrained, and in many cases it will be much easy to visualise what is going on (or, at least, what our chosen sources reveal) than to effect change on those trends that we find most alarming.

Thursday, April 21, 2016

Searching GBIF by drawing on a map

One of my frustrations with the GBIF portal is that it is hard to drill down and search in a specific area. You have to zoom in and then click for a list of occurrences in the current bounding box of the map. You can't, for example, draw a polygon such as the boundary of a protected area and search within that area.

As a quick and dirty hack I've thrown together a demo of how it would be possible to search GBIF by drawing the search area on a map. Once a shape is drawn, we call GBIF's API to retrieve the first 300 occurrences from that area. The code is here, and below is a live demo (see also http://bl.ocks.org/rdmpage/43073981694598fecab725a16e890d3b).

This demo uses Leaflet.draw to draw shapes, and Wicket to convert the GeoJSON shape to the WKT format required by GBIF's API. I was inspired by the Leaflet.draw plugin with options set demo by d3noob, and used it as a starting point.

Friday, April 15, 2016

GBIF and impact: CrossRef, FundRef, and Altmetric

Wiki hitFor anyone doing research or involved in scientific infrastructure, demonstrating the "impact" of those activities is becoming increasingly important. This has fostered a growth industry in "alt metrics", tools to track how research gathers attention outside academia (of course, we can argue whether attention is the same as impact).

For an organisation such as GBIF there's a clear need to show that it has impact on the field of biodiversity (and beyond), especially to its funders (which are ultimately national governments). To do this GBIF needs to track how its data is used by the research communities, both to do science and to inform policy. This is hard to do, especially if there's a limited culture of data citation. It occurs to me that another way to tackle this problem is to invert it by looking not at the impact of GBIF, but at GBIF as a source of impact.

For a moment let's replace GBIF with Wikipedia. We can ask "what is the impact of Wikipedia on the research community?" For example, Wikipedia is the 8th largest referrer of DOIs, which means that Wikipedia is a major source traffic to academic publishing sites. All those Wikipedia pages which cite the primary literature are driving traffic to those articles.

Conversely, if we regard Wikipedia as important we can use citations of articles in Wikipedia pages as a measure of a researcher's impact. For example, according to Impact story I am "Wikitastic" as 11 Wikipedia pages cite articles that I am an author of (authorship is discovered by using my ORCID 0000-0002-7101-9767).

Likewise, altmetric tracks citations on Wikipedia, so that a paper like the one below may have minimal social media impact but as the gray donut rings signifying that it's been cited on Wikipedia.

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

Hence, we can look at Wikipedia in two different ways. The first is to ask "what is the impact of Wikipedia?", the second is to assume that Wikipedia has impact, and then use that as one measure of the impact of researchers (how "Wikitastic" you are).

So, let's go back to GBIF. Imagine we leave aside the question of whether GBIF has impact and imagine that we can use GBIF as a measure of impact ("GBIFtastic", sorry, that was unforgivable).

Example 1: From DOI to FundRef to GBIF

In a previous post I discussed the lack of mosquito data in GBIF and how I plugged this gap by using open data cite by a paper in eLife. This paper has the DOI 10.7554/elife.08347 and if I plug that into CrossRef's search engine I can get back some information on the funders of that paper:

Research funded by Sir Richard Southwood Graduate Scholarship | Rhodes Scholarships | National Institutes of Health (RAPIDD program, R01-AI069341, R01-AI091980, R01-GM08322, N01-A1-25489) | Wellcome Trust (#095066, Vecnet, #099872) | National Aeronautics and Space Administration (#NNX15AF36G) | Biotechnology and Biological Sciences Research Council | Bill and Melinda Gates Foundation (#OPP1053338, #OPP52250) | Studienstiftung des Deutschen Volkes | Directorate-General for Research and Innovation (#21803) | European Centre for Disease Prevention and Control (ECDC/09/018)

Now, this gives me a connection between funding agencies, a paper they funded, and the data in GBIF. For example, the Bill and Melinda Gates Foundation (doi:10.13039/100000865) funded doi:10.7554/elife.08347 which generated data in GBIF doi:10.15468/7apj8n.

I suspect that the Bill and Melinda Gates Foundation don't know that they've funded data gathering that has ended up in GBIF, but I suspect they'd be interested. Especially if that could be quantified (een better if we can demonstrate reuse). The process of linking funders to data can be largely automated, especially as more and more papers are now automatically linked to funder information. The link between publications and data in GBIF can be harder to establish, but at least one publisher (Pensoft) has establish a direct feed from publication to GBIF.

So, what if GBIF could computationally discover the funders of the data it holds, and could then communicate that to the funders. I think there's scope here for funders to take an interest in GBIF and it's role in expanding the reuse (and hence impact) of data that funders have paid for. Demonstrating to governments that national funding agencies are supporting research that generates data that ends up in GBIF may help make the case that GBIF is worth supporting.

Example 2: GBIF as altmetric source

The little altmetric donuts that we see on papers require sources of data, such as Twitter, Wikipedia, blogs, etc. For example, the Plant List dataset I recently put into GBIF has a DOI (doi:10.15468/btkum2)and this has received some attention so it has a altimetric donut (wouldn't it be nice if GBIF showed these on dataset pages?):

What if GBIF itself became a source that altimetric scanned when measuring impact? What if having your papers mentioned in GBIF (for example, as a source of distributional data or a taxonomic name) contributed to the visible impact of that work. Wouldn't that encourage people to mobilise their data? Wouldn't that help people discover the wider conversation about the data and associated publications? Wouldn't that help generate more impact for papers that might otherwise gather less attention?

Summary

I realise that I've somewhat avoided the question of the impact of GBIF itself, which is something that also needs to be tackled (and this is one reason why GBIF assigns DOIs to datasets and downloads to support data citation), but I think that may be only a part of the bigger picture. If we assume GBIF is impactful to start with, then I think we can start to think how GBIF can help persuade researchers and funders that contributing to GBIF is a good thing.

The Zika virus, GBIF, and the missing mosquitoes

One of GBIF's goals is to provide up to date, comprehensive data on the distribution of species. Although GBIF's taxonomy and geographic scope is global, not all species are equal, in the sense that the need for information on some species is potentially much more pressing. An example are mosquitoes of the genus Aedes, such as the species A. aegypti and A. albopictus that spread the Zika virus.

Over the last few days I discovered how poor GBIF's coverage of these two vectors is, and a way to fix that gap quickly. Like many things I work on, I stumbled across the problem by accident. GBIF has released a report on whether GBIF data are fit for modeling species distributions. The publicity material included a psychedelic image showing a map for Aedes aegypti from a recent eLife paper by Kraemer et al. (The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus http://doi.org/10.7554/elife.08347 ).

Moritz et al 2015 Global Aedes aegypti distribution detail2

Curious, I read the paper and the phrase "GBIF" occurs only once in the text:

we selected 10,000 occurrence records of Aedes species from the Global Biodiversity Information Facility (http://www.gbif.org), omitting all records of Ae. aegypti and Ae. albopictus. This dataset is intended to reflect biases in mosquito reporting in areas which are suitable for Aedes mosquitoes.

So, GBIF data on these two mosquitoes wasn't used. A quick look at what GBIF had for Aedes albopictus and it's not surprising why GBIF data played such a small role:

1651430

Compare this with the data shown in the Scientific Data paper (http://doi.org/10.1038/sdata.2015.35 on the data that underpins the eLife paper.

Sdata201535 f3

Note the striking lack of any GBIF records from Brazil. Fortunately the data collected by Kraemer et al. are freely available in Dryad http://doi.org/10.5061/dryad.47v3c, so I grabbed the files, fussed about with them a bit (https://github.com/rdmpage/global-distribution-arbovirus-vectors) to get them into the format required by GBIF, and uploaded them. Below is the data for Aedes albopictus in GBIF:

1651430 updated

This is looking more like it! If you are more interested in Aedes aegypti then that data is also available.

Questions

This example raises a number of questions:

  1. How come GBIF had such poor data to start with? If GBIF is going to be relevant to people who need biodiversity data, in some cases urgently, then there's an argument to be made that GBIF should be targeting species such as disease vectors that are likely to be in demand in the future.
  2. Why wasn't the latest data in GBIF? One reason GBIF's data was poor is that the relevant data was widely scattered in the literature (Kraemer et al. list over 1000 papers that they looked at, not including the unpublished sources). This clearly requires a lot of effort to assemble. But once assembled, why wasn't it deposited in GBIF? Is it a case of researchers not thinking this would be a useful thing to do, or not knowing how to do it?
  3. What about all the other data out there? This particular example was prompted by me wondering what is that hideous image on the GBIF post, reading the eLife article, wondering where the data was, and having sufficient access to GBIF to simply upload the data. This is clearly not a scalable approach. How can we improve this process? Can we automate harvesting relevant data from repositories such as Dryad so that this data gets fed into GBIF automatically? Should GBIF become a data repository itself so authors can store their data there? And how do we retrospectively harvest all the rest of the data languishing in the scientific literature?

Side note

One aspect of the Kraemer et al. data I've not focussed on is that it is derived from the literature, most of it unpublished, but some is in the primary literature (the list of papers is missing from the Dryad repository but I obtained a copy from Moritz Kraemer (@MOUGK and it's now on github). This means we can link individual occurrence records back to the evidence for that occurrence (i.e., the paper that made the assertion that this species of mosquito is found at this locality). This means we can (a) provide provenance for the data, and (b) provide credit to the authors of that observation. I hope to explore this topic in a subsequent blog post.

References

Kraemer, M. U. G., Sinka, M. E., Duda, K. A., Mylne, A., Shearer, F. M., Brady, O. J., … Hay, S. I. (2015, July 7). The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data. Nature Publishing Group. http://doi.org/10.1038/sdata.2015.35

Kraemer, Moritz U. G., Sinka, Marianne E., Duda, Kirsten A., Mylne, Adrian, Shearer, Freya M., Brady, Oliver J., … Hay, Simon I. (2015). Data from: The global compendium of Aedes aegypti and Ae. albopictus occurrence. Dryad Digital Repository. http://doi.org/10.5061/dryad.47v3c

Kraemer, M. U., Sinka, M. E., Duda, K. A., Mylne, A. Q., Shearer, F. M., Barker, C. M., … Hay, S. I. (2015, June 30). The global distribution of the arbovirus vectors Aedes aegypti and Ae. albopictus . eLife. eLife Sciences Organisation, Ltd. http://doi.org/10.7554/elife.08347

The Biodiversity Heritage Library at 10: Let's talk impact interview by @UDCMRK

As part of BHL's "Celebrating 10 years of inspiring discovery through free access to biodiversity knowledge" at the NHM and Kew Gardens in London, I was interviewed by Martin Kalfatovic (@UDCMRK). We chatted about BHL, the work I've been doing on BioStor, and the future of BHL. I haven't had the courage to watch it myself, but if you want to watch an academic giving Roger Hyam a run for his money in the "flappy hands" stakes, and not knowing whether to look at the camera, Martin, or towards the distant horizon, then here is the video.

Friday, April 08, 2016

Guest post: 10 explanations for messy data, by Bob Mesibov

The follow is a guest post by Bob Mesibov, who has contributed to iPhylo before. Bob

Like many iPhylo readers, I deal with large, pre-existing compilations of biodiversity data. The compilations come from museums, herbaria, aggregation projects and government agencies. For simplicity in what follows and to avoid naming names, I'll lump all these sources into a single fictional entity, the PAI (for Projects, Agencies and Institutions).

The datasets I get from the PAI typically contain duplicate records, inconsistencies in content and format, unexplained data gaps, data in wrong fields, fields improperly used, no flagging of doubtful data, etc. Data cleaning consumes most of the time I spend on a data project. Cleaning can take weeks, analysing the cleaned data takes minutes, reporting the results of the analysis takes hours or days. (Example: doi:10.3897/BDJ.2.e1160)

I can understand how datasets get messy. Data entry errors account for a lot of the problems I see, and I make data entry errors myself. But the causes of messiness are one thing and its cure is another. The custodians of those data compilations don't seem to have done much (or any) data checking. Why not?

When I'm brave enough to ask that question, I usually get a polite response from the PAI. Here are 10 explanations I've heard for inadequate data checking and cleaning:

(1) The data are fit for use, as-is. No cleaning is needed, because the data are fit for some use, and the PAI is satisfied with that. One data manager wrote to me in an email: '...even records with lower certainty, in this case an uncertain identification, can be useful at a coarser resolution. Although we have no idea as to the reliability of the identification to the species or even genus they are likely correctly identify[ing] something as at least an animal, arthropod and possibly to class so the record is suitable for analysis at that level.'

(2) The PAI is exposing its data online. The crowd will spot any problems and tell the PAI about them.

I've previously pointed out (doiL10.3897/zookeys.293.5111) how lame this explanation is. As a strategy for data cleaning it's slow, piecemeal and wildly optimistic. At best, it accumulates data-cleaning 'tickets' with no guarantee that any will ever be closed. What I hear from the PAI is 'We're aware of problems of that kind and are hoping to find a general solution, rather than deal with a multitude of individual cases'. Years pass and the individual cases don't get fixed, so interested members of the crowd lose faith in the process and stop reporting problems.

(3) No one outside the PAI is allowed to look at the whole dataset, and no one inside the PAI has the time (or skills) to do data checking and cleaning.

This is a particularly nice Catch-22. I once offered to check a portion of the PAI's data holdings for free, and was told that PAI policy was that the dataset was not to be shared with anyone outside the PAI. The same data were freely available on the PAI's website in bits and pieces through a database search page.

(4) The PAI is migrating to new database software next year. Data cleaning will be part of the migration.

No, it won't. Note that this response isn't always simple procrastination, because it's sometimes the case that the PAI's database has only limited capabilities for data checking and editing. PAI staff are hopeful that checking and editing will be easier with the new software. They'll be disappointed.

(5) The person who manages data is on leave / was seconded to another project / resigned and hasn't been replaced yet / etc.

This is another way of saying that no one inside the PAI has the time to do data checking and cleaning. When the data manager returns to work or gets replaced, data checking and cleaning will have the same low priority it had before. That's why it didn't get done.

(6) Top management says any data cleaning would have to be done by outside specialists, but there's not enough money in the current budget to hire such people.

Not only a Catch-22, but a solid, long-term excuse, applicable in any financial year. It would cost less to train PAI staff to do the job in-house.

(7) The PAI would prefer to use a specialist data tool to clean data, like OpenRefine, but hasn't yet got up to speed on its use.

The PAI believes in magic. OpenRefine will magically clean the data without any thought required on the part of PAI staff. The magic will have to be applied repeatedly, because the sources of the duplications, gaps and errors haven't been found and squashed.

(8) The PAI staff best qualified to check and clean the data aren't allowed to do so.

IT policy strictly forbids anyone but IT staff from tinkering with the PAI database, whose integrity is sacrosanct. A very specific request from biodiversity staff may be ticketed by IT staff for action, but global checking and editing is out of the question. IT staff are not expected to understand biodiversity studies, and biodiversity staff are not expected to understand databases.

This explanation is interesting because it implies a workaround. If a biodiversity staffer can get a dump from the database as a simple text file, she can do global checking and editing of the data using the command line or a spreadsheet. The cleaned data can then be passed to IT staff for incorporation into the database as replacement data items. The day that happens, pigs will be seen flying outside the PAI windows.

(9) The PAI datasets have grown so big that global data checking and editing is no longer possible.

Harder, yes; impossible, no. And the datasets didn't suddenly appear, they grew by accretion. Why wasn't data checking and editing done as data was added?

(10) All datasets are messy and data users should do their own data cleaning.

The PAI shrugs its shoulders and says 'That's just the way it is, live with it. Our data are no messier than anyone else's'.

I've left this explanation for last because it begs the question. Yes, users can do their own data cleaning — because it's not that hard and there are many ways to do it. So why isn't it done by highly qualified, well-paid PAI data managers?

Towards a biodiversity knowledge graph now in RIO

E2asamswAfter experimenting with a dynamic, online version of my notes "Towards a biodiversity knowledge graph" I've published a static version in RIO: doi:10.3897/rio.2.e8767.