Thursday, February 09, 2006

Globally Unique Identifiers

I attended the TDWG-GUID workshop on Global Unique Indenitifers (GUIDs) held at NESCent, which has issued a report. Essentially, the aim of this work is to deploy globally unique identifiers for digital objects in biodiversity informatics, such as taxon names, specimen records, images, etc. The workshop settled on LSIDs (Life Science Identifiers), which is a sensible choice.

LSIDs have been around, and there is considerable software support from IBM (see their project on SourceForge). I've used them in my Taxonomic Search Engine. Not everybody is thrilled by LSIDs (see Anyone using LSID? on NodalPoint).

DOIs and Handles were also considered. I have flirted with handles (see my comments on the iSpecies blog). DOIs have some useful properties, especially stable infrastructure, management tools, and immediate utility by the publishing industry, although they are not cheap. George Garrity uses them in his NamesforLife© project(doi:10.1601/tx.0). Long term the biodiversity community might benefit from thinking seriously about this. The German Science Foundation has invested in providing free DOIs to the German scientific community (see Publication and Citation of Scientific Primary Data). There's also a certain irony in a blog posting talking about GUIDs and rejecting DOIs, when every reference to an external publication is made using, you guessed it, a DOI.

Regarding the workshop itself, at times I wanted to gnaw off parts of my body to retain sanity. As a result I was pretty obnoxious. My frustration stemmed partly from a feeling that the TDWG community seems determined to make life hard for themselves by placing obstacles in their path whenever possible. They've also a lot of investment in XML schema, which I regard as misguided (that's being polite). Anybody who thinks XML schema are the answer to our problems should read "From XML to RDF: how semantic web technologies will change the design of 'omic' standards" (doi:10.1038/nbt1139). I nearly lost it when there was discussion of adopting LSIDs but serving the metadata in XML schema. This defeats the whole point of LSIDs. By serving RDF, we can do inference, in particular we can easily aggregate RDF into triple stores. Populating a database becomes as easy as resolving the LSID and sucking down the metadata. Consequently, data integration suddenly looks a lot more tractable. Indeed, from the perspective of RDF, LSIDs are just another Uniform Resource Identifier (URI), albeit one which consistently resolves to RDF.

As the workshop drew to a close, I began to feel that one reason people just didn't "get" LSIDs and RDF was that there were no really cool examples of what can be done with the technology. If you just look at RDF serialised as XML, then it's not obvious what the big deal is. So we serve a different form of XML, what's the big deal? This is a little like my first impression of XML -- it just seemed like a more fussy version of HTML, so what was all the hype about? Once you see the power of the tools associated with XML (such as the parsers, XSLT and XPath), then you see the point. It can make exchanging and processing data a lot easier, and style XSLT sheets are just way kewl. The difference between XML and RDF is of this order. So, what we need are some cool applications combining LSIDs, metadata, and triple stores to show people just why this is so much more powerful than the XML schema that have obsessed the TDWG community for so long.

10 comments:

Anonymous said...

Very nice summary Rod. I enjoyed meeting you as well as everyone else at the workshop. We are working hard here in Cambridge to write some killer apps with the new LSIDs and RDF we have deployed this week.

Anonymous said...

DiGIR2 supports RDF using LSIDs. Not really a killer app, but one would hope it makes a nice foundation for a whole slew of applications that manipulate, integrate and render RDF based content in a manner similar to (but greatly expanding on) the existing DiGIR (and BioCase) network.

Anonymous said...

I did not sense, as you did, a tight marriage to XML Schema as a mechanism to distribute data in the context of LSIDs. There has been a lot of investment by this community in XML Schema, so there may be a reluctance to simply trash that effort. But I don't think the effort needs to be trashed. XML Schema do serve a function. However, most of the conversations I was involved with at the workshop seemed to point to a strong consensus that metadata would be served in RDF, should LSIDs be adopted. I think many of us who are new to RDF would like to get our heads around it a little bit -- if for no other reason than to be able to think and discuss intelligently about it. But personally, I don't need to be shown a cool example of RDF in action to be persuaded. I trust your insights (and the similar insights of others) implictly on this. Otherwise, I very much appreciated your contributions to the workshop, as well as the contributions of others.

Anonymous said...

Those enamored of RDF might wish to read Rob McCool's opinion in his column in the December and January issues of IEEE Computer Society Internet Computing. Despite being one of its major contributors, he seems to believe it is failing under its own weight for the semantic web, and that TAGS has a reality edge. This may be irrelevant for some applications.

Anonymous said...

Thanks for this, Rod. I greatly appreciated your contributions. Like Rich, I am not sure that there is widespread fundamental resistance to RDF. It is really a matter of putting together examples that demonstrate its use. Most of our data are metadata in the sense of LSID, so this must affect everything that TDWG does. Darwin Core is obviously very RDF-ready and we need to tackle the other areas which currently have XML Schema models. In response to Bob's comments, I think that one of the main advantages of moving towards RDF is not the expectation of being able to derive inferences from billions of statements but rather the ability to manage our (meta-)data models in a more open way. Obviously we hope that it also becomes possible to perform reasoning over some subsets of our data, but there will be limits imposed if nothing else because of the unavoidable differences in taxonomic opinion underlying those data.

Roderic Page said...

Rob McCool's articles are online at doi:10.1109/MIC.2005.133 and doi:10.1109/MIC.2006.18. You need a subscription, although PDF's have appeared elsewhere (Part1 and Part 2). There has been some commentary on NodalPoint. My own take on this is that tags are very cool, especially for low barrier, large scale projects where you have lots of people involved, such as social networking tools. Connotea is a good example that is relevant to scientists. However, for the kinds of projects TDWG and GBIF are interested in, I don't think tags are the answer, and I think many of the arguments Rob McCool makes loose their weight. He's really speaking to a different audience. That said, however, I think his articles are very relevant to any effort to get the many web sites created by enthusiasts to play ball. Somebody who cares about frogs, and has a gorgeous web site with a wealth of information will likely balk at dealing with RDF, but may jump at tagging.

Anonymous said...

I think that most of the talk re serving XML from LSIDs was by way of an upgrade path rather than as a final goal. As you say (rightly or wrongly) the community has put a lot of effort into XML schemas and it worried me (and others) that tying LSIDs to RDF might mean that the LSID baby got thrown out with the RDF bathwater as the community rejected it wholesale. But I was persuaded this wouldn't happen and now I face some scepticism here at Kew about the benefits of RDF so a killer app would be good...

On the meeting itself, yes it was frustrating (and interesting and useful as well) and it struck me on my return that we might have got further had we had some professional (and neutral) facilitators - not to say that the chairs didn't do a good job getting us all to a decision in the end, but that we are all (me included) so parti pris and bound up in the subject that herding cats didn't even come close ... For the next meeting the decisions will be harder and more concrete and there will be a lot to decide. It might help having people who know how to facilitate useful debate and close off some of the blind alleys and circular pathways we have a tendency to wander into

Roger Hyam said...

We mustn't confuse the effort put into the semantics in existing XML Schema based efforts with the XML Schema technology itself. Moving to an RDF way of doing things is a matter of capturing the semantics we have currently agreed on and expressing them in a more flexible way. The danger comes in thinking that in moving away from XML Schema technology we chuck out all the thinking we have done up to this point. We mustn't chuck the baby out with the bath water!

Roderic Page said...

To echo Roger's comments, my objections to XML schema do not extend to all of the work behind them. In serving RDF we need to construct some domain specific vocabularies, most obviously for describing the relationships between names and between concepts (e.g., kinds of synonyms). I've made a very crude start on this for the LSIDs I serve as part of the Taxonomic Search Engine, motivated by getting something done. My efforts should be junked as soon as possible. Those involved in drafting the TCS have thought about this a lot more than I have. Where a vocabulary exists for things that other communities are interested in (such as people, publications, geographic location, etc., e.g. FOAF, PRISM, Basic Geo (WGS84 lat/long)) then I think we should adopt those — especially as the goal is interoperability. I would urge people to avoid reinventing the wheel as much as possible, let's focus on just those elements that are truly specific to biodiversity informatics.

Roderic Page said...

The correct link for the PDF of part 1 of Rob McCool's article is here (in my earlier post the link took you to part 2, doh!)