There’s still time to get a bid together for JISC funding that’s available “to provide expert support to the Digital Repositories programme and, in particular, to the Common Repositories Interfaces Group.”

The DSpace@Cambridge service is looking for a developer.

There were a couple of comments on Andy Powell’s reply post to my post comparing OAI-PMH, Atom and sitemaps for repository harvesting that make it worth revisiting the issue (sorry I didn’t pick them up at the time, I failed to add the conversation to co.mments). Scott pointed out that having link-only feeds is useless for humans – I agree, I was thinking too narrowly about machine clients. Lars Kirchhoff asks: –

Isn’t that [an efficient harvesting API] actually what OAI-PMH is already?

So I would think it would be easier to strip down OAI-PMH for the general purpose use of web resource representation.

The way I see it, because the archives and repositories community is small by comparison with the rest of the web community, if we can have a choice between doing something the web way or using a specialised mechanism, it behoves us to do it the web way. This is the main reason OAI-PMH didn’t feature much in the original post – if it’s possible to use web standards to harvest (and it is!), we should. Scott again: –

Pragmatically, any repository owner today is going to have to do both OAI-PMH and Atom. Hardly a hardship, though, is it?

Today, probably, but will they have to in the future? The search engines would prefer Sitemaps, and are perfectly content to crawl the repository if they can’t get it. Are there essential services that wouldn’t re-engineer to use Atom/Sitemaps if that was more widely used?

Is it a hardship to use OAI-PMH? Well, certainly not if you’re using a repo software that already has an implementation! However, I’m increasingly convinced that repositories are a long tail problem in terms of the software needed – the modal cases are extremely well supported by the current crop of IR softwares, but there’s an awful lot of content that needs specialist handling and curation.

Take CrystalEye as an example – our repository of crystallography data. The software running CrystalEye is heavy on domain specific logic and visualisation, and needs very few of the features offered by IR platforms. I’d still like it to interoperate with other systems so that chemistry specific aggregators can harvest it, so that our IR service could keep a dark archive of the contents and provide preservation services, so that Google can pick up and index the chemical identifiers in the text.

This is only one of any number of systems that will make up the repository landscape in the future. This being the case, interoperability will only come from adopting smallest number of the most simple, widely adopted standards possible.

As usual, I’ve ended up a fair distance away from where I intended to go with this post, which was the (not very new) news that this proposal to extend Atom to formalise links between feed documents (next, prev, last, first) has been promoted to “Proposed Standard” by the IESG. I’m not sufficiently familiar with the IETF process to guess what this means in terms of getting the RFC updated. The extensions would allow “sliding window” access to a feed, which means that standards compliant feeds can be used for reliable harvesting (if your client goes down, or if the polling rate is slow and it misses entries in the feed, it can obtain them by accessing “previous” feed documents).

I’ve (very belatedly) deployed a binary of Peter Corbett’s OSCAR3 release alpha 2 to the WWMM maven2 repository ( Use groupId:wwmm, artifactId:oscar, version:3a2.

Caveat: The OSCAR jar includes all it’s dependencies, so this jar might not play nicely if you’re using any of its dependencies, including lucene, cdk and weka. I’m hoping to persuade Peter to let me mavenize OSCAR in the near future which will sort this problem out.

(sorry didn’t catch this chap’s surname)

Intro to HTML5. James went into the motivations for why we need something better than HTML4, and why it’s not XHTML2.

Was interested to see that the new elements (section, aside etc) were selected by examining CSS ids / classes in current use – unusually pragmatic for a standards process!

The video element allows for content format selection, essentially content negotiation without the negotiation. There are so many other elements where this would be handy.

The amount that has already been developed (at least in 2 out of 3 of [Opera,mozilla,safari]), or that are back compatible using javascript seems impressive.

James unsurprisingly didn’t dwell on some of the aspects of HTML5 that excite me most – notably being able to use PUT and DELETE as a form method, and the use of URL patterns in form actions.

Language design is sometimes expressed as a force triangle between Simplicity, Expressiveness and Power. HTML4 was simple, at the expense of everything else. HTML5 trades in some of that simplicity; mainly for expressiveness, but also for a little extra power.

It’s all good – I hope it happens!

XML Databases link

August 24, 2007

Elliotte Rusty Harold rounds-up the state of the art on XML databases, concluding: –

The XML database space is not nearly as mature as the relational database space. The players are still marching onto the field. The game has not even begun. However it promises to be very exciting when it does.

Depressing, really – why has it taken this long? We’ve all been sitting in the stadium getting cold for 6 years.

Laura is chief techie for AlertMe – a startup taking another bash at home automation. The idea looks cool, and although a little outside my immediate bailiwick there are a couple of overlaps: – can the platform be opened up to partner service providers – how can you manage the security and data protection issues around doing so?

Matt repeated the premise of microformats, that content authors won’t do “big” SW (by which he means RDF, SPARQL and their ilk) and extends this to scientists, and showed us the simple examples used in the Ensembl gene browser. Matt emphasised the benefits of de facto standardisation (rather than the W3C style approach taken by

There was a very positive discussion about GRDDL afterwards. There was quite a bit of emphasis on how GRDDL allows you to disconnect the microformat mark up and the semantics of the data. I’m a bit worried by this – it would mean that semantic web specialists rather than the domain specialists ended up doing the job of standardising the data model. It would be better to keep on standardising in the microformat domain and just using GRDDL as a bridge to the RDF world. That way the data is still standard and still useful without having to cross over to RDF.

Barcamb live 1

August 24, 2007

I’m at Barcamb today, a one day (not actually under canvas thank god) un-conference at Hinxton Hall. I came by train and bike – the journey here on the minor roads between Whittlesford station and Hinxton Hall looked fairly straightforward on gmaps and indeed, I only got lost twice. However, the google satellite obviously passed over at a time of year when the road wasn’t flooded. So I’m sitting at the back dripping gently into the carpet. Looks like a good program, lots of variety and interest.