No JSF here, thank-you

June 28, 2007

I occasionally post about technologies I like, less often about ones I don’t.The inimitable Koranteng has just come across JSF, and he doesn’t like it either.

Peter Sefton has blogged about his progress integrating chemistry into ICE. I’m excited about this – chemical molecules are a good example of needing custom representations depending on the output context; CML for machines, CML + the JMol applet for hypermedia documents, PNG for static documents. This also has potential for us as part of SPECTRa-T, providing a rich view of chemical metadata annotations.

There’s a research associate position available working with Peter Murray-Rust and others here at the Unilever Centre. There is potential for some fantastic applied research in the role; using semantic web technologies, CML and text mining technologies to generate cutting edge polymer informatics tools.

A basic repository feature is providing a list of all the resources in a collection, and a way to incrementally discover changes. The usual way for repos to enable this is OAI-PMH, using either the ListRecords verb or the ListIds verb, and the ‘from’ argument to perform efficient incremental update, and the resumptionToken system to enable the server to condition the load generated.

The way the rest of the world does it is with Atom or RSS. Unnecessary retrievals can be prevented using conditional GET. The server chooses the size of the feed documents so it can control it’s own load. It’s even possible to avoid lost updates or list an entire collection using ‘first’, ‘last’, ‘next’ and ‘previous’ links (as in this tip). There’s no direct equivalent of PMH’s ‘from’ but as long as the feed has timestamps on each entry, then the client knows when to stop retrieving more feed chunks.

I’m currently reading the REST book, so I’m in a frenzy of resource-oriented fervour. OAI-PMH is, in the REST patois, a STREST interface (this theme was picked up in the discussion between Carl Lagoze and Andy Powell recently). The rich resource discovery possible with OAI-PMH is also overkill for what I’m after here.

I’m also unsure about syndication – I have a feeling that the resource representations in Atom / RSS feeds are unlikely to satisfy most repository clients’ needs. Isn’t a more resource-oriented approach to simply link to the resource and let the client negotiate with the resource for an appropriate representation? If so, Sitemaps fit the bill perfectly.

Well, maybe, but on balance I still think that Atom / RSS is a better choice; the RESTful repository will almost certainly have a feed around for human clients, and it’s better to adapt this for machine clients than adopt an additional mechanism.

The project web sites for the SPECTRa tools are now online, and I’ll be moving the help and documentation over in due course. The main point of the SPECTRa tools is to make building repository ready packages of X-Ray crystallography, NMR spectroscopy and Gaussian input files as easy as possible.

Once prepared, the packages can be saved to local storage for manual deposition or deposited into a DSpace repository (although this requires some customization of DSpace). Hopefully I’ll have time to write a SWORD client for it too, and I’ve been thinking about writing an S3 client for fun.

In other news: –

  • Peter Suber blogged the SPECTRa report
  • The WWMM server is finally back on decent hardware with all its data. Hopefully there won’t need to be any more outages for a while.

I’ve been moving more of SPECTRa over to Sourceforge and finding things out about the service. Something that made me sit back and have a think was the sourceforge backup policy. In a nutshell this states that they take at least weekly backups, but won’t restore them unless there’s a catastrophe at their end. I’d say the #1 risk, with high hazard and probability, is me making a mistake. Sourceforge don’t protect you against yourself.

That’s fair enough, it’s just a bit more work to arrange backups. But I’m glad I noticed the policy (I got there from a reference in the login shell), other colleagues I spoke to have used Sourceforge for years, were unaware of it, and don’t backup, expecting sourceforge to do it. The only problem here was my own expectation. Sourceforge provide a generally great service, and I’ve never heard a tale of woe about data loss on Sourceforge, so I built an expectation that they take care of backup and restore.

Perhaps this effect works in favour of IRs? One of the values of an IR is in acting as a deposition, access and dissemination service (as especially espoused by OA evangelists). Another value is in the provision of good curation. The expectation is matched by the service. I think, though, that the expectation has been built in a large part by the increasingly recognized brands of repo software projects such as DSpace, ePrints, Fedora et al.

I think this lies at the heart of why I felt initially uncomfortable about the idea of repositories sitting wholly within the web architecture (Andy Powell on the subject): If the IR is presented as ‘just a website’ then there’s no expectation, and you have to work to convince the user that they’re getting value. If you buy in to the web architecture vision Andy and others have been describing for IRs (as I have!), and if you agree that IRs are going to need a whole range of softwares to satisfy their users’ needs, then the importance of the software brand is going to be less and less important to users’ perception of the value of the IR, which might diminish as a result.

Earlier this week I attended JISC’s Dealing with the Data Deluge conference; part of their digital repositories programme work. The presentations were good, and more importantly there were some very interesting thoughts flying around in coffee rooms, dinner halls and pubs.

One of the stand out presentations for me was John MacColl’s presentation on the findings of the StORe project, which was investigating issues around data repositories and linking research publication repositories to data repositories. Two items in particular caught my notice.

Firstly, StORe found that whilst academia treats PhD students very differently to postdoctoral researchers, their data management, curation and reposition requirements are the same. This is interesting from my point of view on the SPECTRa-T project; it’s reassurance that SPECTRa-T will be relevant to the wider problem of chemistry publications even though our focus is on theses.

It’s also encouraging for anyone who wants the state of the art in data repositories to move forward, since this will almost inevitably require changes in the behaviour of researchers and PhD candidates tend to be more open to change.

The second thing that particularly caught my notice was StORe’s conclusion that data curation is difficult task which we cannot / should not burden researchers with. Additionally, it’s so specialised that the expertise probably can’t be provided at an institutional level, but could be successfully handled by a number of (perhaps peripatetic) specialist data librarians (e.g. funded by JISC).

This strikes a chord; from my early experiences with chemistry data on the DSpace@Cambridge project building the WWMM collection there, it was clear to me that a centralized institutional repository service could not hope to effectively preserve specialist scientific data. It seemed to me that preservation could only be achieved by a collaboration between people with curation expertise (librarians) and domain expertise on data formats and trends. Thinking on it more I’ve decided that you can apply this not just to “specialist scientific data”, but to any data that isn’t in the usual run of office and web formats. John’s findings are a more wide ranging statement of this, applying to all of curation, not just to preservation. It’ll be interesting to see whether and how the JISC or other funding bodies take this idea up.

As John pointed out (supported by Chris Rusbridge subsequently), this all makes the AHRC’s strange decision to cease funding for the AHDS particularly disappointing, especially since AHDS are providing a service that’s pretty close to John’s vision. Let’s hope this petition has some positive impact.