In an interesting post, Karen Coombs shares some of her issues in relating their library web site redevelopment to the need to provide web services to the rest of the university: –

If faculty could do their searches without coming to the library site would they? I think the answer is yes.

Long term I’d like a site which has a series of web services that can be exploited by my developers but also my the university web developers and who knows who else. Focusing on content rather than look and feel will allow us to provide these different types of services. It will also allow different types of users to potentially selectively access content.

I don’t think I’ve read anything like this outside a REST advocacy presentation!

Ultimately, I feel like it is these kinds of services that will make of break a library’s virtual presence not the library website. And with a limited staff, this means I like to choose carefully how much time I have my small staff spend on the tradition site. Otherwise, we could spend all our time caught up in look and not enough time working to make the library meet users where they are and be a seamless part of their work processes.

This doesn’t have to be a choice. Because Karen is concentrating on content, she is in a superb position to deliver the services she describes through the website, using good semantic markup, linked resources and well tempered feeds or sitemaps, using The REST book as a manual. This is an advantage of REST I hadn’t fully grokked; it’s cheaper. If you already have a website and need to provide service to your users, it’s quicker and easier to develop the website further RESTfully than to start an entirely separate service delivery development.

Advertisements

There were a couple of comments on Andy Powell’s reply post to my post comparing OAI-PMH, Atom and sitemaps for repository harvesting that make it worth revisiting the issue (sorry I didn’t pick them up at the time, I failed to add the conversation to co.mments). Scott pointed out that having link-only feeds is useless for humans – I agree, I was thinking too narrowly about machine clients. Lars Kirchhoff asks: –

Isn’t that [an efficient harvesting API] actually what OAI-PMH is already?

So I would think it would be easier to strip down OAI-PMH for the general purpose use of web resource representation.

The way I see it, because the archives and repositories community is small by comparison with the rest of the web community, if we can have a choice between doing something the web way or using a specialised mechanism, it behoves us to do it the web way. This is the main reason OAI-PMH didn’t feature much in the original post – if it’s possible to use web standards to harvest (and it is!), we should. Scott again: –

Pragmatically, any repository owner today is going to have to do both OAI-PMH and Atom. Hardly a hardship, though, is it?

Today, probably, but will they have to in the future? The search engines would prefer Sitemaps, and are perfectly content to crawl the repository if they can’t get it. Are there essential services that wouldn’t re-engineer to use Atom/Sitemaps if that was more widely used?

Is it a hardship to use OAI-PMH? Well, certainly not if you’re using a repo software that already has an implementation! However, I’m increasingly convinced that repositories are a long tail problem in terms of the software needed – the modal cases are extremely well supported by the current crop of IR softwares, but there’s an awful lot of content that needs specialist handling and curation.

Take CrystalEye as an example – our repository of crystallography data. The software running CrystalEye is heavy on domain specific logic and visualisation, and needs very few of the features offered by IR platforms. I’d still like it to interoperate with other systems so that chemistry specific aggregators can harvest it, so that our IR service could keep a dark archive of the contents and provide preservation services, so that Google can pick up and index the chemical identifiers in the text.

This is only one of any number of systems that will make up the repository landscape in the future. This being the case, interoperability will only come from adopting smallest number of the most simple, widely adopted standards possible.

As usual, I’ve ended up a fair distance away from where I intended to go with this post, which was the (not very new) news that this proposal to extend Atom to formalise links between feed documents (next, prev, last, first) has been promoted to “Proposed Standard” by the IESG. I’m not sufficiently familiar with the IETF process to guess what this means in terms of getting the RFC updated. The extensions would allow “sliding window” access to a feed, which means that standards compliant feeds can be used for reliable harvesting (if your client goes down, or if the polling rate is slow and it misses entries in the feed, it can obtain them by accessing “previous” feed documents).

(sorry didn’t catch this chap’s surname)

Intro to HTML5. James went into the motivations for why we need something better than HTML4, and why it’s not XHTML2.

Was interested to see that the new elements (section, aside etc) were selected by examining CSS ids / classes in current use – unusually pragmatic for a standards process!

The video element allows for content format selection, essentially content negotiation without the negotiation. There are so many other elements where this would be handy.

The amount that has already been developed (at least in 2 out of 3 of [Opera,mozilla,safari]), or that are back compatible using javascript seems impressive.

James unsurprisingly didn’t dwell on some of the aspects of HTML5 that excite me most – notably being able to use PUT and DELETE as a form method, and the use of URL patterns in form actions.

Language design is sometimes expressed as a force triangle between Simplicity, Expressiveness and Power. HTML4 was simple, at the expense of everything else. HTML5 trades in some of that simplicity; mainly for expressiveness, but also for a little extra power.

It’s all good – I hope it happens!

Laura is chief techie for AlertMe – a startup taking another bash at home automation. The idea looks cool, and although a little outside my immediate bailiwick there are a couple of overlaps: – can the platform be opened up to partner service providers – how can you manage the security and data protection issues around doing so?

Matt repeated the premise of microformats, that content authors won’t do “big” SW (by which he means RDF, SPARQL and their ilk) and extends this to scientists, and showed us the simple examples used in the Ensembl gene browser. Matt emphasised the benefits of de facto standardisation (rather than the W3C style approach taken by microformats.org).

There was a very positive discussion about GRDDL afterwards. There was quite a bit of emphasis on how GRDDL allows you to disconnect the microformat mark up and the semantics of the data. I’m a bit worried by this – it would mean that semantic web specialists rather than the domain specialists ended up doing the job of standardising the data model. It would be better to keep on standardising in the microformat domain and just using GRDDL as a bridge to the RDF world. That way the data is still standard and still useful without having to cross over to RDF.

Barcamb live 1

August 24, 2007

I’m at Barcamb today, a one day (not actually under canvas thank god) un-conference at Hinxton Hall. I came by train and bike – the journey here on the minor roads between Whittlesford station and Hinxton Hall looked fairly straightforward on gmaps and indeed, I only got lost twice. However, the google satellite obviously passed over at a time of year when the road wasn’t flooded. So I’m sitting at the back dripping gently into the carpet. Looks like a good program, lots of variety and interest.

No JSF here, thank-you

June 28, 2007

I occasionally post about technologies I like, less often about ones I don’t.The inimitable Koranteng has just come across JSF, and he doesn’t like it either.

A basic repository feature is providing a list of all the resources in a collection, and a way to incrementally discover changes. The usual way for repos to enable this is OAI-PMH, using either the ListRecords verb or the ListIds verb, and the ‘from’ argument to perform efficient incremental update, and the resumptionToken system to enable the server to condition the load generated.

The way the rest of the world does it is with Atom or RSS. Unnecessary retrievals can be prevented using conditional GET. The server chooses the size of the feed documents so it can control it’s own load. It’s even possible to avoid lost updates or list an entire collection using ‘first’, ‘last’, ‘next’ and ‘previous’ links (as in this tip). There’s no direct equivalent of PMH’s ‘from’ but as long as the feed has timestamps on each entry, then the client knows when to stop retrieving more feed chunks.

I’m currently reading the REST book, so I’m in a frenzy of resource-oriented fervour. OAI-PMH is, in the REST patois, a STREST interface (this theme was picked up in the discussion between Carl Lagoze and Andy Powell recently). The rich resource discovery possible with OAI-PMH is also overkill for what I’m after here.

I’m also unsure about syndication – I have a feeling that the resource representations in Atom / RSS feeds are unlikely to satisfy most repository clients’ needs. Isn’t a more resource-oriented approach to simply link to the resource and let the client negotiate with the resource for an appropriate representation? If so, Sitemaps fit the bill perfectly.

Well, maybe, but on balance I still think that Atom / RSS is a better choice; the RESTful repository will almost certainly have a feed around for human clients, and it’s better to adapt this for machine clients than adopt an additional mechanism.

I had a brief look at DBpedia.org thanks to PMR’s excitement in the area. I was particularly interested in how they would deal with the problem of describing and linking representations.

In comments to one of PMR’s posts, Richard Cyganiak writes “Note that the DBpedia URIs also work in a web browser, so you can go to http://dbpedia.org/resource/Uppsala and the DBpedia server will generate a web page showing the information it has about the item.”. Well, kind of they work, but actually what happens is that you get redirected to http://dbpedia.org/page/Uppsala (*). DBpedia have chosen for their concept URIs not to resolve to a representation, but to redirect to others.

I imagine the architecture options available to DPpedia for this were something like: –

Assign unique URLs to alternative representations
Pro: simple for people to see the different representations.
Con: need Follow-Your-Nose (FYN) linking system to reach the data, for which there is no formal standard
Use a single URL for all the representations and use content negotiation to switch between them
Pro: no confusion about which URI is the concept URI, URI resolves.
Con: Precious little support in browsers, limited to MIME type switching, no way of finding out which representations are available before you make a request.
Use GRDDL
Pro: Formally defined profiles, developing standard.
Con: Need all of the resource description in the view representation.

So DBpedia went for the first option, and they provide a way for a programmatic client to link from the HTML view to the metadata; you could use something like /html/head/link[@rel=’alternate’ and @type=’application/rdf+xml’ and @title=’RDF’]@href (apologies for any mistakes, my XPath-fu is not very hot). I have a vague recollection that this is a W3C endorsed best practice, but I can’t remember the link.

Why not GRDDL? GRDDL is about extracting metadata from the original XML source, which means that you have to have all the metadata in the HTML. This might not be desirable if you have a lot of metadata or if it’s important to you to keep the HTML small and tight, or if you have a pressing desire not to use XHTML.

Note that you can also make the link solution work in a GRDDL world by transforming the link element to an rdfs:seeAlso statement that points to the bulk of your RDF representation, but that requires a little more sophistication on the part of the GRDDL client.

Of course, both of these approaches assume that once you’re in RDF it’s ponies for everybody; you can describe all the alternative representations, list metadata and properties and describe relationships.

In conclusion: –

  • If we can assume that our description of different resource representations will be in RDF, then the link solution seems to do the trick of describing and linking representations
  • We can make the link solution work with GRDDL, but GRDDL without rdfs:seeAlso won’t be universally applicable.
  • Is content negotiation a dead end?
  • Is there a good reason concept URIs should / should not be directly resolvable?

* In fact, web browsers get redirected. My command line client (curl) just got ditched with no content.

The best and worst thing about delving down into the lower reaches of my RSS list is that I occasionally stumble across pieces on things that were on my mind anyway. Probably important, definitely interesting, they are nonetheless a sunny path that leads into the woods as far as the morning’s productivity goes. Usually they are scanned, and then stay in the tab stack waiting for a more thorough reading until firefox crashes a few days later.

So if you’re in need of diversion and you were thinking about XForms, scalable web architectures and things of that nature too; don’t hesitate to ramble through this post by Koranteng, particularly taking in this piece by Mark Birbeck, and the excellent presentation by Adam Bosworth.