Monday, October 06, 2008

Global biogeographical data bases on marine fishes: caveat emptor

D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes:
Any biogeographical analysis of fish distributions that uses GIS data on marine fishes provided by FishBase and OBIS 'as is' will be seriously compromised by the high incidence of species with large-scale geographical errors. A major revision of GIS data for (at least) marine fishes provided by FishBase, OBIS, GBIF and EoL is essential. While the primary sources naturally bear responsibility for data quality, global online providers of aggregated data are also responsible for the content they serve, and cannot side-step the issue by simply including generalized disclaimers about data quality. Those providers need to actively coordinate, organize and effect a revision of GIS data they serve, as revisions by individual users will inevitably lead to confused science (which version did you use?) and a tremendous expenditure of redundant effort. To begin with, it should be relatively easy for providers to segregate all data on pelagic larvae and adults of marine organisms that they serve online. Providers should also include the capacity for users to post readily accessible public comments about the accuracy of individual records and the overall quality of individual data bases. This would stimulate improvements in data quality, and generate 'selection pressures' favouring the usage of better quality data bases, and the revision or elimination of poor-quality data bases. The services provided to the global science community by the interlinked group of online providers of biodiversity data are invaluable and should not be allowed to be discredited by a high incidence of known serious errors in GIS data among marine fishes, and, likely, other marine organisms. (emphasis added)

As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information in public databases. I suspect there will be more papers like this, which I hope will inspire database providers and aggregators to take the issue seriously. (Thanks to David Patterson for spotting this paper).

3 comments:

rpg said...

Perhaps instead of writing more papers about the problem, we could discuss some solutions. In point of fact, GBIF has funded some folks (me included... also Reed Beaman at Florida, Paul Flemons at the Australia Museum and Andrew Hill, also at CU) to help with this problem. The solution we are working towards works as follows:

The process involves accessing the records that you currently serve through the GBIF network, and processing the locality descriptions within our proposed system. This system is very simple and will do the following:

1. A data harvester collects records from participating providers and resources.
2. Next, these records are sent to the BioGeomancer (http://www.biogeomancer.org ) web service (http://bg.berkeley.edu) that interprets the locality description. A latitude and longitude and an associated estimate of geographic uncertainty are returned.
3. The new geo-referenced information is linked to the original data record and stored with additional descriptive details concerning the geo-referencing process.
4. Results are made available to providers for review, comment and action. We will provide some simple tools to help with data record presentation and evaluation.

We think setting up a pipeline to get records to Biogeomancer, which returns a latitude-longitude-uncertainty triplet (or triplets) for each record, is going to be the right way to go to georeference new records and solve some existing problems.

Roderic Page said...

Rob,

I'm all for action. Regarding looking up localities, Metacarta have some cool webservices that return estimates of certainty.

But I think a lot could be done simply by looking at mutual inconsistencies in existing databases. Many of the cases I've documented would be detected by looking for conflicts between the latitude and longitude pair and the reported country.

I also think that aggregators such as GBIF have a key role to play as it's often only in aggregate that errors become obvious ("how come my points are so far away from yours?").

Finally, I think a key obstacle is the lack of globally unique, shared identifiers for specimens. Until we get this sorted it is hard to share annotations regarding georeferencing. The failure to have these identifiers in place is crippling progress.

Anonymous said...

I need to add to the above comment by rpg that Robertson's article is also part of his "campaign" (and of "phylogeographers" ar large) against Panbiogeography, as the availability of large-scale databases of geo-referenced specimens naturally offers a huge potential for panbiogeographic analyes (I have not yet seen the article, but will not be surprised if he uses my recent paper on hagfish biogeography which appeared in the Journal of Biogeography as an "example" of "errors" in using online biogeographic databases -- even if we have even provided an assessment of that in the paper).