Friday, November 10, 2017

Exploring images in the Biodiversity Literature Repository

A post by on the Plaza blog Expanded access to images in the Biodiversity Literature Repository has prompted me to write up a little toy I created earlier this week.

The Biodiversity Literature Repository (BLR) is a repository of taxonomic papers hosted by Zenodo. Where possible Plazi have extracted individual images and added those to the BLR, even if the article itself is not open access. The justification for being able to do this is presented here: DOI:10.1101/087015. I'm not entirely convinced by their argument (see Copyright and the Use of Images as Biodiversity Data) but rather than rehash that argument I decide dit would be much more fun to get a sense of what is in the BLR. I built a tool to scrape data from Zenodo and store it in CouchDB, put a simple search engine on top (using the search functionality in Cloudant) to search within the figure captions, and wrote some code to use a cloud-based image server to generate thumbnails for the images in Zenodo (some of which are quite big). The tool is hosted at Heroku, you can try it out here: https://zenodo-blr-interface.herokuapp.com/.

Screenshot 2017 11 10 11 03 30

This is not going to win any design awards, I'm simply trying to get a feel for what imagery BLR has. My initial reactions was "wow!". There's a rich range of images, including phylogenies, type specimens, habitats, and more. Searching by museum codes, e.g. NHMUK is a quick way to discover images of specimens from various collections.

Screenshot 2017 11 10 11 22 05

Based on this experiment there are at least two things I think would be fun to do.

Adding more images

BLR already has a lot of images, but the biodiversity literature is huge, and there's a wealth of imagery elsewhere, including journals not in BLR, and of course the Biodiversity Heritage Library (BHL). Extracting images from articles in BHL would potentially add a vast number of additional images.

Machine learning

Machine learning is hot right now, and anyone using iNaturalist is probably aware of their use of computer vision to suggest identifications for images you upload. It would be fascinating to apply machine learning to images in the BLR. Even basic things such as determining whether an image is a photo or a drawing, how many specimens are included, what the specimen orientation is, what part of the organism is being displayed, is the image a map (and of what country) would be useful. There's huge scope here for doing something interesting with these images.

The toy I created is very basic, and merely scratches the surface of what could be done (Plazi have also created their own tool, see http://github.com/punkish/zenodeo). But spending a few minutes browsing the images is well worthwhile, and if nothing else is a reminder of both how diverse life is, and how active taxonomists are in trying to discover and describe that diversity.