More OAI-PMH vs Atom vs Sitemaps Or “Why I’m A Bit Down On OAI-PMH”

August 29, 2007

There were a couple of comments on Andy Powell’s reply post to my post comparing OAI-PMH, Atom and sitemaps for repository harvesting that make it worth revisiting the issue (sorry I didn’t pick them up at the time, I failed to add the conversation to co.mments). Scott pointed out that having link-only feeds is useless for humans – I agree, I was thinking too narrowly about machine clients. Lars Kirchhoff asks: –

Isn’t that [an efficient harvesting API] actually what OAI-PMH is already?

So I would think it would be easier to strip down OAI-PMH for the general purpose use of web resource representation.

The way I see it, because the archives and repositories community is small by comparison with the rest of the web community, if we can have a choice between doing something the web way or using a specialised mechanism, it behoves us to do it the web way. This is the main reason OAI-PMH didn’t feature much in the original post – if it’s possible to use web standards to harvest (and it is!), we should. Scott again: –

Pragmatically, any repository owner today is going to have to do both OAI-PMH and Atom. Hardly a hardship, though, is it?

Today, probably, but will they have to in the future? The search engines would prefer Sitemaps, and are perfectly content to crawl the repository if they can’t get it. Are there essential services that wouldn’t re-engineer to use Atom/Sitemaps if that was more widely used?

Is it a hardship to use OAI-PMH? Well, certainly not if you’re using a repo software that already has an implementation! However, I’m increasingly convinced that repositories are a long tail problem in terms of the software needed – the modal cases are extremely well supported by the current crop of IR softwares, but there’s an awful lot of content that needs specialist handling and curation.

Take CrystalEye as an example – our repository of crystallography data. The software running CrystalEye is heavy on domain specific logic and visualisation, and needs very few of the features offered by IR platforms. I’d still like it to interoperate with other systems so that chemistry specific aggregators can harvest it, so that our IR service could keep a dark archive of the contents and provide preservation services, so that Google can pick up and index the chemical identifiers in the text.

This is only one of any number of systems that will make up the repository landscape in the future. This being the case, interoperability will only come from adopting smallest number of the most simple, widely adopted standards possible.

As usual, I’ve ended up a fair distance away from where I intended to go with this post, which was the (not very new) news that this proposal to extend Atom to formalise links between feed documents (next, prev, last, first) has been promoted to “Proposed Standard” by the IESG. I’m not sufficiently familiar with the IETF process to guess what this means in terms of getting the RFC updated. The extensions would allow “sliding window” access to a feed, which means that standards compliant feeds can be used for reliable harvesting (if your client goes down, or if the polling rate is slow and it misses entries in the feed, it can obtain them by accessing “previous” feed documents).

Advertisements

One Response to “More OAI-PMH vs Atom vs Sitemaps Or “Why I’m A Bit Down On OAI-PMH””

  1. Chris Says:

    When OAI-PMH was created there were two important layers: the data providers (repositories) and the service providers. It turns out there are very many of the former but very few of the latter, and those little used. When did you last search on OAISTER rather than Google? The protocol had other features such as partitions (I think) that were meant to allow specialist (eg subject) services to work, but critical issues were left un-standardised, so in fact this has not turned out to be useful.

    I’m rather left with the feeling that OAI-PMH was important but has in practice turned out to be less useful than we expected. Much of what we thought was needed can now be done in other (often better) ways. Not sure how much this supports your argument.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: