Friday, September 30, 2011

Taylor and Francis Online breaks DOIs - lots of DOIs

TandFOnline twitterDOIs are meant to be the gold standard in bibliographic identifier for article. They are not supposed to break. Yet some publishers seem to struggle to get them to work. In the past I've grumbled about BioOne, Wiley, and others as cuplrits with broken or duplicate or disappearing DOIs.

Today's source of frustration is Taylor and Francis Online. T&F Online is powered by (Atypon), which recently issued this glowing press release:

SANTA CLARA, Calif.—20 September 2011—Atypon®, a leading provider of software to the professional and scholarly publishing industry, today announced that its Literatum™ software is powering the new Taylor & Francis Online platform ( Taylor & Francis Online hosts 1.7 million articles.
"The performance of Taylor & Francis Online has been excellent," said Matthew Jay, Chief Technology Officer for the Taylor & Francis Group. "Atypon has proven that it can deliver on schedule and achieve tremendous scale. We're thrilled to expand the scope of our relationship to include new products and developments."

Great, except that lots of T&F DOIs are broken. I've come across two kinds of fail.

DOI resolves to server that doesn't exist
The first is where a DOI resolves to a phantom web address. For example, the DOI doi:10.1080/00288300809509849 resolves to But the domain doesn't exist, so the DOI is a dead end.

DOI doesn't resolve
Taylor and Francis have digitised the complete Annals and Magazine of Natural History, a massive journal comprising nearly 20,000 articles from 1841 to 1966, and which has published some seminal papers, including A. R. Wallace's "On the law which has regulated the introduction of new species" doi"10.1080/037454809495509 which forced Darwin's hand (see the Wikipedia page for the successor journal Journal of Natural History. Taylor and Francis are to be congratulated for putting such a great resource online.

Problem is, I've not found a single DOI for any article in Annals and Magazine of Natural History that actually works. If you try and resolve the DOI for Wallace's paper, doi"10.1080/037454809495509, you get the dreaded "Error - DOI not found" web page. So something like 20,000 DOIs simply don't work. The only way to make the DOI work is append it to "", e.g. This gets us to the article, but rather defeats the purpose of DOIs.

Something is seriously wrong with CrossRef's quality control. It can't be too hard to screen all domains to see if they actually exist (this would catch the first error). It can't be too hard to take a random sample of DOIs and check that they work, or automatically check DOIs that are reported as missing. In the case the Annals and Magazine of Natural History the web page for the Wallace article states that it has been available online since 16 December 2009. That's a long time for a DOI to be dead.

There is a wealth of great content that is being made hard to find by some pretty basic screw ups. So CrossRef, Atypon and Taylor and Francis, can we please sort this out?

Wednesday, September 21, 2011

Linked data that isn't: the failings of RDF

OK, a bit of hyperbole in the morning. One of the goals of RDF is to create the Semantic Web, an interwoven network of data seamlessly linked by shared identifiers and shared vocabularies. Everyone uses the same identifiers for the same things, and when they describe these things they use the same terms. Simples.

Of course, the reality is somewhat different. Typically people don't reuse identifiers, and there are usually several competing vocabularies we can chose from. To give a concrete example, consider two RDF documents describing the same article, one provided by CiNii, the other by CrossRef. The article is:

Astuti, D., Azuma, N., Suzuki, H., & Higashi, S. (2006). Phylogenetic Relationships Within Parrots (Psittacidae) Inferred from Mitochondrial Cytochrome-b Gene Sequences(Phylogeny). Zoological science, 23(2), 191-198. doi:10.2108/zsj.23.191

You can get RDF for a CiNii record by appending ".rdf" to the URL for the article, in this case For CrossRef you need a Linked Data compliant client, or you can do something like this:

curl -D - -L -H "Accept: application/rdf+xml" ""

You can view the RDF from these two sources here and here.

No shared identifiers
The two RDF documents have no shared identifiers, or at least, any identifiers they do share aren't described in a way that is easily discovered. The CrossRef record knows nothing about the CiNii record, but the CiNii document includes this statement:

<rdfs:seeAlso rdf:resource="
&amp;id=info:doi/10.2108/zsj.23.191" dc:title="CrossRef" />

So, CiNii knows about the DOI, but this doesn't help much as the CrossRef document has the URI "", so we don't have an explicit statement that the two documents refer to the same article.

The other shared identifier the documents could share is the ISSN for the journal (0289-0003), but CiNii writes this without the "-", and uses the PRISM term "prism:issn", so we have:


whereas CrossRef writes the ISSN like this:

<ns0:issn xmlns:ns0="">

Unless we have a linked data client that normalises ISSNs before it does a SPARQL query we will miss the fact that these two articles are in the same journal.

Inconsistent vocabularies
Both CiNii use the PRISM vocabulary to describe the article, but they use different versions. CrossRef uses "" whereas CiNii uses "". Version 2.1 versus version 2.0 is a minor difference, but the URIs are different and hence they are different vocabularies (having version numbers in vocabulary URIs is asking for trouble). Hence, even if CiNii and CrossRef wrote ISSNs in the same way, we'd still not be able to assert that the articles come from the same journal.
Inconsistent use of vocabularies
Both CiNii use FOAF for author names, but they write the names differently:

<foaf:name xml:lang="en">Suzuki Hitoshi</foaf:name>

<ns0:name xmlns:ns0="">Hitoshi Suzuki</ns0:name>

So, another missed opportunity to link the documents. One could argue this would be solved if we had consistent identifiers for authors, but we don't. In this case CiNii have their own local identifiers (e.g., and CrossRef has a rather hideous looking Skolemisation:

In summary, it's a mess. Both CiNii and CrossRef organisations are whose core business is bibliographic metadata. It's great that both are serving RDF, but if we think this is anything more than providing metadata in a useful format I think we may be deceiving ourselves.

Tuesday, September 20, 2011

Orwellian metadata: making journals disappear

UnknownI've been spending a lot of time recently mapping bibliographic citations for taxonomic names to digital identifiers (such as DOIs). This is tedious work at the best of times (despite lots of automation), but it is not helped but the somewhat Orwellian practices of some publishers. Occasionally when an established journal gets renamed the publisher retrospectively applies that name to the previous journal. For example, in 2000 the journal Entomologica Scandinavica (ISSN 0013-8711) became Insect Systematics & Evolution (ISSN 1399-560X):

(diagram based on WorldCat xISSN history tool, rendered using Google Charts.)

Content for both Entomologica Scandinavica and Insect Systematics & Evolution is available from Ingenta's web site, but every article is listed as being in Insect Systematics & Evolution, and this is reflected in the metadata CrossRef has for each DOI.

For example, the paper
Andersen, N.M. & P.-p. Chen, 1993. A taxonomic revision of pondskater genus Gerris Fabricius in China, with two new species (Hemiptera: Gerridae). – Entomologica Scandinavica 24: 147-166

has the DOI doi:10.1163/187631293X00262 which resolves to a page saying this article was published in Insect Systematics & Evolution. The XML for the DOI says the same thing:

<issn type="print">1399560X</issn>
<issn type="electronic">1876312X</issn>
<journal_title>Insect Systematics & Evolution</journal_title>

In one sense this is no big deal. If you know the DOI then that's all you need to use to refer to the article (and the sooner we abandon fussing with citation styles and just use DOIs the better).

But if you haven't yet found the DOI then this is problem, because if I search CrossRef using the original journal name (Entomologica Scandinavica) I get nothing. As far as CrossRef is concerned the DOI doesn't exist. If, however, I happen to know that Entomologica Scandinavica is now Insect Systematics & Evolution, I rewrite the query and I retrieve the DOI.

It's bad enough dealing with taxonomic names changes without having to deal with journal names changes as well! It would be great if publishers didn't indulge in wholesale renaming old journals, or if CrossRef had a mechanism (perhaps based on WorldCat's xISSN History Visualization Tool) to handle retrospectively renamed journals.

Thursday, September 15, 2011

Anchoring Biodiversity Information: from Sherborn to the 21st century and beyond

Next month I'll be speaking in London at The Natural History Museum at a one day event Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. This meeting is being organised by the International Commission on Zoological Nomenclature and the Society for the History of Natural History, and is partly a celebration of his major work Index Animalium and partly a chance to look at the future of zoological nomenclature.

Details are available from the ICZN web site. I'll be giving a a talk entitled "Towards an open taxonomy" (no, I don't know what I mean by that either). But it should be a chance to rant about the failure of taxonomy to embrace the Interwebs.

SherbornPoster Sept 11

Wednesday, September 14, 2011

I think I now "get" the Encylopedia of Life

The Encylopedia of Life (EOL) has been relaunched, with a new look and much social media funkiness. I've been something of an EOL sceptic, but looking at the new site I think I can see what EOL is for. Ironically, it's not really about E. O. Wilson's original vision (doi:10.1016/S0169-5347(02)00040-X:
Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linking to other data bases, such as ARKive, Ecoport, GenBank and MORPHOBANK. It comprises a summary of everything known about the species’ genome, proteome, geographical distribution, phylogenetic position, habitat, ecological relationships and, not least, its practical importance for humanity.
We still lack a decent database that does this. EOL tries, but in my opinion still falls short, partly because it isn't nearly aggressive enough in harvesting and linking data (links to the primary literature anyone?), and has absolutely no notion of phylogenetics.

In terms of doing science I don't see much that I'd want to do with EOL, as opposed, say, to Wikipedia or existing taxonomic databases. But thinking about other applications, EOL has a lot of potential. One nice feature is the ability to make "collections". For example, Cyndy Parr has created a collection called Fascinating textures, which is simply a collection of images in EOL (I've included some below):

What is nice about this is that it cuts across any existing classification and assembles a set of taxa that share nothing other than having "fascinating textures". This ability to tag taxa means we could create all sorts of interest sets of taxa based on criteria that are meaningful in a particular context. For example, egotist that I am, I created a collection called Taxa described by Roderic Page, which includes the one crab and 6 bopyrid isopods that I described in the 80's.

Putting on my teaching hat, I'm involved in teaching a course on animal diversity and could imagine assembling collections of taxa relevant to a particular lecture (either taxonomically, or based on some other criteria, such as all parasites of a particular taxon, or all organisms found associated with deep sea vents. Other collections could be built by people or organisations with content. For example, lists of top ten new species, lists of species for which the BBC has content, etc.

In this sense, EOL becomes a tagging service for life, a bit like delicious. The social network side of things is still a little clunky —there doesn't seem to be a notion of "contacts" or "friends", and it needs integration with existing social networks — but I think I now "get" what EOL is for.

Tuesday, September 13, 2011

Phantom articles: why Mendeley needs to make duplication transparent

Browsing Mendeley I found the following record: This URL is for a paper
Costa, J. M., & Santos, T. C. (2008). Description of the larva of. Zootaxa, 99(2), 129-131
which apparently has the DOI doi:10.1645/GE-2580.1. This is strange because Zootaxa doesn't have DOIs. The DOI given resolves to a paper in the Journal of Parasitology:
Harriman, V. B., Galloway, T. D., Alisauskas, R. T., & Wobeser, G. A. (2011). Description of the larva of Ceratophyllus vagabundus vagabundus (Siphonaptera: Ceratophyllidae) from nests of Rossʼs and lesser snow geese in Nunavut, Canada. The Journal of parasitology, 93(2), 197-200
Now, this paper has it's own record in Mendeley.

OK, so this is weird..., but it gets weirder. If you look at the Mendeley page for this chimeric article there is a PDF preview of yet another article:
LOPES, Maria José Nascimento; FROEHLICH, Claudio Gilberto and DOMINGUEZ, Eduardo (2003). Description of the larva of Thraulodes schlingeri (Ephemeroptera, Leptophlebiidae). Iheringia, Sér. Zool. 92(2), 197-200 2003 doi:10.1590/S0073-47212003000200011
Mendeley duplicate

But it gets even more interesting. The abstract for the phantom Zootaxa article belongs to yet another paper:
Marques, K. I. D. S., & Xerez, R. D.Description of the larva of Popanomyia kerteszi James & Woodley (Diptera: Stratiomyidae) and identification key to immature stages of Pachygastrinae. Neotropical Entomology, 38(5), 643-648.
which also exists in Mendeley.

To investigate further I used Mendeley's API to retrieve this record (I had to look at the source of the web page to find the internal identifier used by Mendeley, namely 010c48d0-edb5-11df-99a6-0024e8453de6 to do this, why does Mendeley hide these?). Here's the abbreviated JSON for this record.

"website": "http:\/\/\/pubmed\/21506868",
"identifiers": {
"pmid": "21506868",
"issn": "19372345",
"doi": "10.1645\/GE-2580.1"
"issue": "2",
"pages": "129-131",
"public_file_hash": "fe7eed3f6c43a3be1480a0937229b9ad33666df4",
"publication_outlet": "Zootaxa",
"type": "Journal Article",
"mendeley_url": "http:\/\/\/research\/description-larva\/",
"uuid": "010c48d0-edb5-11df-99a6-0024e8453de6",
"authors": [
"forename": "J M",
"surname": "Costa"
"forename": "T C",
"surname": "Santos"
"title": "Description of the larva of",
"volume": "99",
"year": 2008,
"categories": [
"oa_journal": false

Doesn't add much to the story, but does give us the sha1 for the PDF for the chimeric article (fe7eed3f6c43a3be1480a0937229b9ad33666df4). If I download the PDF for the article in Iheringia, Sér. Zool. it has the same sha1:

openssl sha1 a11v93n2.pdf
SHA1(a11v93n2.pdf)= fe7eed3f6c43a3be1480a0937229b9ad33666df4

This article doesn't exist
So, to summarise, this paper doesn't exist. It is credited to a journal that doesn't have DOIs, the DOI resolves to an article in a different journal, the abstract comes from another article in another journal, and the PDF is from a third article. OMG!

This is just weird
So, something about the way Mendeley merges references is broken. Merging references is a tough problem so there will always be cases where things go wrong. But it would be really, really helpful if Mendeley could display the set of articles that it has merged to create each canonical reference (say by listing the UUIDs for each article). Users could then see if badness had happened, and provide feedback, for example by highlighting references that are clearly the same, and those that are clearly different. Until this happens I'm a bit nervous about trusting Mendeley with my bibliographic data, I don't want it mangled into chimeric papers that don't exist.

Rethinking citation matching

Some quick half-baked thoughts on citation matching. One of the things I'd really like to add to BioStor is the ability to parse article text and extract the list of literature cited. Not only would this be another source of bibliographic data I can use to find more articles in BHL, but I could also build citation networks for articles in BioStor.

Citation matching is a tough problem (see the papers below for a starting point).

Citation::Multi::Parser is a group in Computer and Information Science on Mendeley.

To date my approach has been to write various regular expressions to extract citations (mainly from web pages and databases). The goal, in a sense, is to discover the rules used to write the citation, then extract the component parts (authors, date, title, journal, volume, pagination, etc.). It's error prone — the citation might not exactly follow the rules, there might be errors (e.g., OCR, etc.). There are more formal ways of doing this (e.g., using statistical methods to discover which set of rules is most likely to have generated the citation, but these can get complicated.

It occurs to me another way of doing this would be the following:
  1. Assume, for arguments sake, we have a database of most of the references we are likely to encounter.
  2. Using the most common citation styles, generate a set of possible citations for each reference.
  3. Use approximate string matching to find the closest citation string to the one you have. If the match is above a certain threshold, accept the match.

The idea is essentially to generate the universe of possible citation strings, and find the one that's closest to the string you are trying to match. Of course, tis universe could be huge, but if you restrict it to a particular field (e.g., taxonomic literature) it might be manageable. This could be a useful way of handling "microcitations". Instead of developing regular expressions of other tools to discover the underlying model, generate a bunch of microcitations that you expect for a given reference, and string match against those.

Might not be elegant, but I suspect it would be fast.

More BHL app ideas

Hero rosellasFollowing on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elywreplied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to you see this:

N2 w1150

which gives you no idea that it contains images like this:

n24_w1150Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:
  1. Use BHL web server logs to find and extract referrals from those projects
  2. Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.

Monday, September 12, 2011

Duplicate DOIs for the same article: alias DOIs, who knew?

As part of a project to map taxonomic citations to bibliographic identifiers I'm tackling strings like this (from the ION record for Pseudomyrmex crudelis):

Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants. Zoological Journal of the Linnean Society, 126(4), August 1999: 451-540. 516 [Zoological Record Volume 136]

I parse the string into its components (e.g., journal, volume, issue, pagination) and use scripts to locate identifiers such as DOIs. I regard DOIs as the gold standard for bibliographic identifiers. The are (usually) unique, and CrossRef provides some really useful services to support them (DOIs now also support linked data if you are in to that sort of thing). Occasionally there are problems, such as duplicate DOIs when material moves from a publisher's site to, say, JSTOR. And some publishers are really, really bad at releasing DOIs that don't resolve. For example, Taylor & Francis Online have at least 18,000 DOIs for the Annals and Magazine of Natural History that don't resolve (e.g., doi:10.1080/00222933809512318 for this paper).

Sometimes my automated scripts for finding DOIs fail and I have to resort to Googling. To my surprise, I found two versions of the paper "Systematics, biogeography and host plant associations of the Pseudomyrmex viduus group (Hymenoptera: Formicidae), Triplaris- and Tachigali-inhabiting ants", each with a different DOI:

Now, this isn't supposed to happen. Interestingly, if you resolve doi:10.1006/zjls.1998.0158, either on the web or using CrossRef's OpenURL resolver, you get the page/metadata for doi:10.1111/j.1096-3642.1999.tb00157.x.

To see what was going on I fired up my local installation of Tony Hammnd's OpenHandle tool (see and entered the Elsevier DOI (10.1006/zjls.1998.0158) and got this:

"comment" : "OpenHandle (JSON) - see" ,
"handle" : "hdl:10.1006/zjls.1998.0158" ,
"handleStatus" : {
"code" : "1" ,
"message" : "SUCCESS"
} ,
"handleValues" : [
"index" : "100" ,
"type" : "HS_ADMIN" ,
"data" : {
"adminRef" : "hdl:10.1006/zjls.1998.0158?index=100" ,
"adminPermission" : "111111110111"
} ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Thu Apr 13 19:09:03 BST 2000" ,
"reference" : []
} ,
"index" : "1" ,
"type" : "URL" ,
"data" : "" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:12 BST 2003" ,
"reference" : []
} ,
"index" : "700050" ,
"type" : "700050" ,
"data" : "20030811104844000" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Tue Aug 12 16:43:16 BST 2003" ,
"reference" : []
} ,
"index" : "1970" ,
"type" : "HS_ALIAS" ,
"data" : "10.1111/j.1096-3642.1999.tb00157.x" ,
"permission" : "1110" ,
"ttl" : "+86400" ,
"timestamp" : "Mon Aug 25 21:06:50 BST 2008" ,
"reference" : []

The interesting bit is the "HS_ALIAS" at the bottom. I'd not come across this before, although it's in the spec (RFC 3651) for all to see (yeah, but who reads those?). The handle system that underlies DOIs has mechanism to support aliases, so that a DOI that originally pointed to a web page (say, for an article) can be redirected to point to another DOI. In this case, the Elsevier DOI redirects to the Wiley DOI ("10.1111/j.1096-3642.1999.tb00157.x" in the HS_ALIAS section), so the user ends up at Wiley's page for this article, not Elsevier's. This provides a way to accommodate changes in article ownership, without requiring an existing publisher to reuse the previous publisher's DOI.

In one sense this seems to defeat the point of DOIs, namely that they are effectively opaque identifiers that any publisher should be able to host. Perhaps in this case the issue is that the DOI prefix ("10.1006" and "10.1111" for Elsevier and Wiley, respectively) corresponds to a publisher, and when something goes wrong with a DOI it's easier to identify who is responsible based on this prefix, rather than the individual DOI.

In any event, next time I come across a duplicate DOI I'll need to check whether it is an alias of another DOI before launching into another rant about the (occasional) failings of DOIs.

Wednesday, September 07, 2011

Suggested apps for BHL's Life and Literature Code Challenge

Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

