The DSpace technical review meeting is over and I think we can say we won. But boy, it was hard work!

So what have we achieved? The full gory glory is available here and the distilled recommendations will be published more formally at some point. In the meantime, some highlights: –

  • A roadmap for migration from the JSP UI to Manakin
  • A definition of layers for a new architecture (storage API, properly abstracted interfaces to central services, all that good stuff)
  • An improved abstract data model that combines enough abstraction to support complex objects with making the entities important in repositories explicit.
  • A model for versioning (conceptually very similar to subversion’s)
  • New architecture and data model will not evolve incrementally from current architecture and data model. This isn’t starting from scratch – we have a good definition of most of the functionality required (the current implementation), and working code to borrow from.

There are, of course, areas we didn’t get to covering and details still to be decided on. The reason I’m upbeat about this week is that the recommendations we will be making have been discussed (Vigorously. At length.) by a group of great people all knowledgeable about repositories, so they’re pretty robust.

Things are starting to starting to happen again in DSpaceland. Watch this space!

DSpace Technical Review

October 25, 2006

MIT CSAIL

We’re nearly halfway through the DSpace technical review meeting and it’s going well. I’d hoped that this process would help to resolve some of the issues that have been perennial sticking points for the DSpace architecture’s development. The meeting is being roughly minuted on the wiki, so please email the devel list or any of the participants if don’t like anything there – soonest raised soonest fixed.

Some of the technical review group members
John Mark Ockerbloom, John Erickson, Richard Jones, Gabriela Mircea, Scott Phillips

This sounds like a really important and interesting project and is part of the European Driver project, but I’m afraid I zoned out early on because it’s all been implemented on top of a proprietary search technology. I’m more convinced than ever that ‘open’ services are no kind of replacement for OS software.

RepoMMan is a project to develop a repository for Hull University. Chris is a good, coherent speaker, but I’m still not entirely convinced by his assertion that it’s a good thing that Fedora has had no UI. There’s a lot to be said for having lots of users!

RepoMMan is fully buzzword compliant, up and down the stack with some interesting technologies; Fedora, BPEL, MVC, SOA. Will they open the source code, I wonder?

There’s also a strong emphasis on automatic extraction of metadata, which is a subject close to my heart.

Simeon works on arXiv, and is really presenting the Pathways project which is leading on to the OAI ORE activity. For anyone not in the repositories area, this is a Big Deal to us. Consequently the landscape for Simeon’s presentation is OA, and the focus is scholarly communication.

He defines interoperability as

  • Improved linking
  • Better discovery across repositories
  • Overlayed tools
  • Provenance. Citation, article creation.

The really interesting aspect of this list is the omission of content portability, which is usually one of the first aspects of interop to be mentioned.

The Pathways fabric consists of

  1. Shared Data Model
  2. Shared Serilization of Model
  3. Shared Services

I need to research further into Pathways before next week’s DSpace technical review meeting.

Pathways models persistent identifiers as: –
1. Provider (identity of repository)
2. preferredIdentifier
3. version

How will this enable a client to resolve this? Wouldn’t they need prior use of the identifier scheme the repository uses in order to resolve the identifier? Or perhaps “Resolve” should be another of the shared services?

Simeon is emphasizing human mediated interop. I hope wholly automated interop isn’t out of scope. Correction: it isn’t – I just got the wrong end of the stick.

I’m in Glasgow today for the Open Scholarship 2006 conference, and decided to have a bash at live blogging some of the presentations this morning.

After a pretty stressful journey yesterday (curse you, Central trains!) I arrived with a whole 20 minutes to watch other people before I was on the podium giving an update on DSpace. I’ve never been happier for a pint of 80 shilling!

How to make a hamburger

October 6, 2006

PDF is a menace for a lot of informatics and preservation.

Here, Eliot Kimber explains why.

If you don’t know Maven, it’s kind of a build tool. At a very simple level you lay your projects out in a certain way, and it gives you a rich build solution with compilation, version tracking, unit testing, documentation suites and library / application deployment for free.

But Maven isn’t just a tool, it’s also a vision of best practise. Consequently it’s not in maven’s interests to support you if you want to step out of line, or disagree with their vision of “the right thing”. So whilst it’s extensible and it’s possible to route around some of the maven way, it’s usually more painful to bend maven to your will than it is to give in and go with the flow.

It sounds terrible, but is this kind of “Nanny Software” really such a bad thing in the right place? At the moment in the Unilever centre we have very little standardization in project layouts and build methods. A shell script here, an ant file there. I’m keen to use promote the use of maven here partly because standardizing will make collaboration between projects much easier and should help projects retain their usefulness if they end up being mothballed for a while. It won’t hurt the standardization effort that maven will fight back a little if people try to get creative with their build process.

So how does all this relate to DSpace?

To me, DSpace 1.0 (and 1.1 to an extent) held to the same philosophy of best practise embodied in software. The adoption of the software helped the adoption of standard practise, but you hit the sides pretty quickly if you wanted to do anything outside HP/MIT’s idea of best practise. This has all changed as the development community has grown. Flexibility has become the watch-word.

The upcoming technical review a great opportunity to at least discuss some fundamentally important decisions: Can and should DSpace have the flexibility to be all things to all people (like the Eclipse RCP)? Alternatively, should DSpace be primarily a best practise vehicle? If so, can the DSpace Federation come up with a single expression of best practise?

Economics Open Data

October 5, 2006

It seems that whilst I was in Maastricht yesterday talking to economists about chemistry open data, PMR was in Washington listening to an economist.

It’s always rewarding to cross area boundaries, although as a mechanical engineer turned software bod and working in chemoinformatics, I seem to get more than my fair share. Nonetheless, I found yesterday’s workshop even more rewarding than I had expected.

Economics makes an interesting contrast to chemistry, in that it has a strong preprint culture and Open Access seems to be far better accepted than in chemistry. On the other hand, paying for access to primary research data is accepted as a part of life – although sharing datasets is on their radar I don’t think loss of data and replication of work is as much of a pain point for economics as it is for chemistry.

The Nereus consortium has used DSpace to federate metadata across their institutions. However, some of the institutions who are furthest along with their repository programmes are planning, or talking about planning, to drop DSpace in favour of Fedora. Obviously I don’t take these decisions personally but it behoves me to understand why.

Part of the reason is functionality DSpace 1 lacks; top shelf metadata support, complex objects and versioning were all mentioned. I’m hopeful that the technical review meeting at the end of October will go a long way towards starting to fix those issues.

There was another issue raised, though, by Peter van Huisstede of Erasmus Universiteit Rotterdam, who observed that DSpace is built to do things a certain way, and it’s often hard to do things another way. Which got me thinking about whether this was a good or bad thing.