For me, Friday turned out to be the highlight of the week, and I wish I could have live blogged it. Unfortunately I positioned myself too far forward in the hall where the wireless was weak.

Of the presentations that morning, two were especially relevant to the SPECTRa project, the (hopefully!) upcoming SPECTRa-Theses project and the work of the Murray-Rust group as a whole; Lee Giles talking about ChemXSeer and the closing keynote from Tony Hey.

ChemXSeer is taking on some really important problems in ChemoInformatics, most particularly the paucity of the data commons in chemistry (their specific area is environmental chemistry, but most of the issues and tools he presented looked as if they would transfer to other chemistry specialisms). ChemXSeer has tools for identifying chemical entities (compound and reaction names and so on) in journal papers, ontologies for chemistry (PMR group: is this sounding at all familiar?!), and even tools for extracting useful data from tables and figures in papers. Not simple problems to crack; it’ s awesome that they’re taking them on.

The approaches of the PMR group, when it comes to getting chemical data, are two-headed; both extracting useful chemistry data from sources where it is badly encoded (e.g. in English), but also by improving ways of publishing data in the first place (with the work on the uses of InChI and CML).

I asked Lee how they approached data quality (perhaps hoping they were setting up protocols for CML publishing) – he replied that pragmatically they found it best to extract data from the papers and then offer it back to the authors for correction and annotation, rather than set high requirements for deposition. The evening before I had been at dinner with Peter Sefton (amongst others) who shared a tip on improving quality in MS Word authored theses; his system periodically shows the author the product of converting their document into PDF / HTML. In his experience authors quickly learn not to override the structural markup with hacky font changes! This kind of feedback sytem would work well with data also, I think, allowing authors to work with familiar creation tools whilst encouraging them to improve the usefulness of their output.

Tony Hey’s presentation was less chemistry specific, but great. He painted an attractive vision of an Open Data future, pointing out the opportunities and challenges along the way.

It was a little strange having the virtues of openness sung in what was, in a way, a Microsoft keynote. HP are the only big tech company I’ve noticed with a visible involvement in IRs so far (obviously a skewed view, being a DSpacer by trade). Sun on the hardware side, I suppose. I still don’t really understand what MS intend to do in the IR area; where in-between “do some good and maybe sell some licenses on the way” and “nice sector, we’ll take it” they want to go. Tony, being a fairly recent MS acquisition, pointed out that MS wasn’t fundamentally anti-OS, just anti-copyleft. Personally I’m fine with that – I go for “‘Free’ as in ‘free'” as well, but I wonder how the ePrints guys were feeling at that point.

To close things finally the ever-charismatic Les Carr announced next year’s Open Repositories conference in Southampton. See you there!

Latest Handle jar

January 26, 2007

Handle 6.2.3 has been released. Mavenites can obtain it from the wwmm repository ( with group ‘handle’, artifact ‘handle’ and version ‘6.2.3’.

Carl Lagoze presented the OAI-ORE project. Emphasised that ORE is not just about asset transfer – it’s about interoperability between systems that manage content, and systems that leverage (sic ;-)) managed content.

Their mantra: “Whatever we do it must be congruent with the web architecture”, as the web architecture is reasonably well developed engineering (!). Failing to do this was a failure in OAI-PMH.

Carl gave an overview of Web Architecture pointing out that representations don’t have locators / identifiers, and that there’s no standard way of defining a finite set of resources on the web. The ORE Model fixes this, formally expressing a bounded aggregation of resources and relationships – a connected sub-graph.

ORE resources are access points for service requests, and services are in three classes: Harvest, Obtain and Register. (There was an implication that ORE Harvest service would replace PMH in the future).

What now? Flesh out use cases as a tool for testing the model, Review appropriate technologies, move from model to implementation (after May 2007 meeting).

Comment: I like the sound of ORE much better than it sounded in the Pathways days. The resource centric bits sound good, but there devil will be in the detail of the verb set and how that works.

Julie Allinson presented the JISC Repository Deposit Service Description project. (Disclosure: I’m involved in this work and also the JISC Common Repository Interfaces Group).

The purpose of this work was to come up with a lightweight standard for deposit across the different repository platforms and for projects on the JISC repositories program that would provide a short term fix until somebody (think “ORE”) came up with the right answer.

The scope of Deposit in this work was pre-ingest – and purely concerns the mechanism of getting content to a repository platform and not the ongoing process of ingest, which may involve e.g. format migration, human editorial mediation etc.

The service itself consists of three parts – a deposit service description (which is the richest in terms of information provided), a very simple (HTTP POST and a few standard part names) and a receipt.

Richard presented the various tools that have come from the SIMILE project. While he was demonstrating Timeline Mark Diggory leant over to me and sotto voce’d “we’re going to be looking at building this stuff into a Manakin aspect”, which will be very, very cool.

Also exciting was Richard’s demo of DWell (as in DSpace + Longwell), which I’ve blogged about before. It looks great, and the applications for curation Richard mentioned are going to be really important – faceted browsing makes it really easy to see outliers and typos in metadata.

Update: DWell demo.

First this morning MacKenzie Smith described the PLEDGE project, which is a collaboration between MIT libraries and San Diego Supercomputer Center. PLEDGE involves encoding preservation policy in a machine readable form (RDF against a couple of defined ontologies) to manage preservation across content by replication distributed across a grid of computers (hence the SDCC tie in). I find the idea really appealing, and I hope it could be the basis for preservation service description between IRs and preservation server providers (e.g.).

Next up, Joan Smith of Old Dominion University described mod oai, an Apache module that generates an OAI feed of web content from an Apache server, with the aim of improving crawling (and hence archiving). I had two thoughts about this – firstly that much of this is overkill for the purposes of archiving; if the main problem is uncrawled content, then the Open Sitemap approach is a more appropriate technology. The second was hoping that it would be straightforward to use the code / approach to write mod_sitemap, and how useful that would be!

Last presentation in the first session is by Miguel Ferreira from the University of Minho in Portugal, who is presenting CRiB; a distributed framework and architecture for preservation services. The framework involves analysis of repo contents by format, and a service to advise on migration strategies. Good stuff, as ever, from Miguel and the rest of the team at Minho. I’m going to have to collar him at coffee to ask him how this might fit in with the Migration on Ingest work in the DSpace@Cambridge project (dev notes).

Compare and Contrast

January 25, 2007

There were two stories on the blogosphere that really caught my eye yesterday. Both at first glance are about large corporate entities trying to FUD the public. The first is the e-fracas (etc etc etc) caused by Microsoft paying Rick Jellife to correct any inaccuracies on the Wikipedia pages concerning ODP and OOXML.

The second was the story that the American Association of Publishers has paid a hefty sum to a PR agency for what amounts to a slur campaign against the free information movement (via).

I’m gobsmacked by the reaction in the first case – if we leave aside the techno-religious mudslinging the main criticism seems to be that MS were acting in an underhand fashion and their approach wasn’t transparent. This story didn’t make CNN because the wikipedians found out about it after the fact and reported it, it made CNN because Rick blogged it and several people at Microsoft confirmed it. How much more transparency is needed? The wikipedia version of NPOV is evidently not intrinsic to basic notion of building a trusted public commons.

I’m gobsmacked by the second story in itself. When I first read it on Peter Suber’s blog I assumed he was uncovering some misreporting and was going to conclude by commenting that making this kind of stuff up doesn’t help, but it seems to be the straight story. I felt defensive and a bit downcast at first, then I realised that this is great news. When your detractors resort to FUD you know you’re right, and you know you’re winning.

Open Repositories 2007

January 24, 2007

I’m at the Open Repositories conference in San Antonio this week, mainly to present the SPECTRa project to the DSpace Users Group. The presentation (URL to follow) went well, with some good feedback, especially from John Mark Ockerbloom suggesting a route for our embargo schema.

There are few people blogging about the DSpace sessions, and it’s good to see positive feedback about the technical direction that came from the architecture review. MacKenzie Smith had some good news on progress towards establishing a not-for-profit organization to own and support DSpace. The ePrints and Fedora UG meetings are being held in parallel, so I’m taking as many opportunities as I can to defect and see what life’s like on the other sides.

I’m in the ePrints 3 launch session at the moment. The functionality looks good, although it’s apparently the main point of the release is extensibility at a lower level (although I haven’t got a strong handle on what’s extensible). The interface is pretty too, looks like they’ve been busy with Dojo. Lots of collapsable sections, ajax auto-complete and web 2.0 gradients. No reflections or rounded corners though 😉

I don’t know why this didn’t hit my radar sooner but check out Dwell, which introduces RDF faceted browsing into DSpace. I haven’t had time to install it and check it out yet, but I’m pretty excited – having spent 3 years making metadata repositories for the public sector Longwell blew me away when I first saw it at the 2004 user group meeting.

Maven1, JUnit4, TestNG

January 8, 2007

The long awaited fix that enables maven2 to work with JUnit4 has finally been applied. Hopefully it will be in a snapshot near you soon.

I’m afraid I lost patience, and defected to TestNG (encouraged by Nate Sarr). The differences for simple use are pretty trivial (a different @Test annotation to mark test cases, expected exceptions pretty much the same and so on). Even now though, I’m pretty convinced by the use of Java 1.4 assert statements rather than Assert.assertFoo method calls.

TestNG always seemed like the Betamax of unit testing, and since I haven’t used any of the whizzier features yet, perhaps that’s fair.

Rich Apodaca has written a very worthwhile 30,000′ intro to Open Source licenses. OS licensing has been a recurring theme in my life over the last few years, from attempting to persuade my former employer to open source their product, through working with DSpace and latterly here at the Unilever Centre. However much of a pain the whole subject is, it can’t be avoided – as Rich concludes “If you plan on creating or using Open Source software, learning the basic ideas behind Open Source licensing is a wise investment.”, and I’d like to pass on a few more hints to make it easier.

Remember that copyleft licenses don’t prevent commercial use of the software, but do limit your potential collaborations (you might not, for example, be able to write an integration with BSD licensed software).

If you’re the copyright holder for the software and you retain copyright, you can alter the license at some future point. This implies that it’s worth favouring a more restrictive license initially, because the licenses can be more easily relaxed than tightened.

Finally; an important aspect of OS licensing is not to get too worked up about choosing a license to defend you against abuse by a commercial player. Your license is as strong as your ability to defend it, so at the end of the day you probably have to accept that if some big evil corporate really wants to abuse ‘pinch your code’, they will. The speed, talent and productivity of an open collaboration are the best defence.