Web feeds and repositories

December 10, 2008

I was invited to give a presentation on RSS and Atom as part of a SUETr workshop on interoperability yesterday. Of course I didn’t even scratch the surface of what can be achieved with feeds in terms of mash-ups, 3rd party sites and visualisations – but I did try to get across the breadth of ‘repository’ problems feeds can address, and the importance of feeds in easy wins to add value to your repository efforts (this theme courtesy of Les Carr on his blog).

The slides can be downloaded from either of these places: –

Last Thursday I attended a JISC workshop on repository architectures. It was a thought provoking day, and I learned a lot. Firstly I learnt that I need to pay more attention to context when quoting people on twitter (sorry, @paulwalk).

Paul Walk kicked off the day by presenting his work on a next generation architecture for repositories. His presentation started off with a number of starting principles and moved on to some diagrams illustrating a specific architecture based on them. As Paul mentions in that blog post, his diagrams and the principles behind them were “robustly challenged”. As far as I remember, the diagrams were challenged more robustly than the principles.

To cut a longish story short, the discussion and the workshop exercises brought up some interesting ideas, but did relatively little to either validate Paul’s architecture diagrams, or to provide a working alternative. Chatting about it over lunch and later over a pint, I was persuaded that we were looking for an abstraction that doesn’t exist, and that the desire for a single generic repository architecture might have led us down the garden path.

Software engineering, being a field that values pragmatic epistemology, has a couple of empirically derived laws that might help to explain why. Firstly, Larry Tesler’s law of the Conservation of Complexity states (in a nutshell) that complexity can’t be removed, just moved around. A natural way to manage this complexity is find abstractions that hide some of it. This, fundamentally, is what the repositories architecture is trying to do – reduce the multiplicity of interests, politics and data-borne activities of HE into a single abstract architecture.

A second empirical law, The Law of Leaky Abstractions, states that all non-trivial abstractions leak. Some of the complexity cannot be hidden behind the abstraction and leaks through. It feels to me that this is what’s happening with repositories at the moment. Our abstraction (centralization, services provided at point of storage etc) fails to cope with real, current complexities. The problem itself is extremely complex, and if anyone really has their head around it, they’ve still got the hard task of communicating it to the whole community so a good shared abstraction can be developed.

I found myself going back to Paul’s starting principles, and concluded that they were a much more constructive framework for thinking about repository issues than the concrete architectures in the diagrams. Paraphrasing the principles: –

  • Move necessary activity to the point of incentive
  • [Terms of reference for IRs]
  • Pass by reference, not by copy
  • Move complexity towards the point of specialisation
  • Expect and accept increasing complexity on the local side of the repository with more sophisticated workflow integration.

With the exception of the point on IRs, they are all forms of guidance on complexity, either where to move it (“Move [metadata] complexity towards the point of specialisation”), or which trade-offs to make (“Pass by reference, not by copy” => “Prefer to deal with the complexities of references than the complexities of duplication”). The reason I like this approach is that different disciplines, institutions and activities (e.g. REF, publication, administration) all have different complexities and different drivers. Perhaps we need a number of different architecture abstractions based on constraints and drivers. Perhaps the idea of an architecture abstraction is premature in this community and we should focus on local solutions (in the sense of ‘minima’ rather than geography). This needn’t end in technical balkanization; the repositories architecture is driven by business models, and focusing on interoperability and the web architecture allows more of the technical discussion to happen in parallel.

To get the ball rolling, I’d like to add a caveat to Paul’s “Move [metadata] complexity towards the point of specialisation”: “… unless it’s there already and it’s harder to recreate than maintain”. Any more?


Andrew McGregor has posted extensive minutes and notes from the meeting.

A successful codebash (via) as part of the ICE RS project, that got a load of useful vertical integration done. Hopefully we’ll be seeing something similar in the CRIG space soon!

Roundup 14th Dec

December 14, 2007

It’s been a tough week for (especially institutional) repositories: – with some of the criticism specific (David Flanders, Dorothea Salo) and some a little more general (“centralization is a bug”). But it’s not all doom and gloom; Paul Walk clears up being taken out of context and Peter Murray-Rust announces the Microsoft e-Chemistry project, which I’m optimistic will make big advances in practical repository interop.

On a complete tangent, I discovered Mark Nottingham’s blog and read with delight about his HTTP caching extensions. My relationship with HTTP has been like one of those infernal teen movies where the in-crowd-but-with-a-conscience kid finally dispels the fog of peer pressure and preconception, and invariably finds the class dork to be the member-of-opposite-sex of dreams. Not that an in-crowd kid could ever utter the words “my relationship with HTTP”.

One thing and another meant I was unable to blog final thoughts and summaries about the CRIG unconference that I attended last week, so this is rather long, being a combination of the post I would have written Friday afternoon and a couple of consequent thoughts.

Firstly, on unconferencing, or at least the way it was implemented for CRIG; I like it a lot. Since we were partly a talking shop and partly a workshop to refine interoperability scenarios and challenges, the main session worked essentially like a lightweight breakout session system – topics were assigned to whiteboards, and people chose topics, migrated, discussed and withdrew as they wished. It was leagues more interesting and more productive than being assigned a breakout group with a topic. Successive rounds of dotmocracy helped to sort out the zeitgeist from the soapboxes. I could see this format working extremely well for e.g. the Cambridge BarCamp, or the e-Science All Hands meeting.

This was the first face to face meeting of the CRIG members as CRIG members, and really helped to frame the agenda for CRIG. I realized that there are some big issues underlying repositories that only become really important when discussing interoperability. For example, I can see OAI-ORE creating the same kind of fun and games around pass-by-ref, pass-by-val that the community currently enjoys when discussing identifiers, and just like identifiers, it touches just about every scenario.

One message came out pretty strongly; the emphasis on repositories isn’t useful in itself. One of the topics for discussion that passed dotmocracy (i.e. was voted as something people wanted to talk about) was “Are repositories an evolutionary dead end?”, a theme picked up by David Flanders. Well, I personally don’t think so, but then I’ve probably got a more malleable definition of “the R word” (as Andy Powell puts it) than most. If I’ve read the mood correctly, people are beginning to regard centralized, data storing, single-software, build-it-and-they-will-come IRs as a solution looking for a problem. Some regard repositories as a complete diversion, others that we should act on our improved understanding of the problems in academic content management and dissemination by acknowledging failed experiments and moving on quickly. Nobody gave me the impression that they thought the current approaches would work given a couple more years, more effort or more funding.

This has all been said before; when the conference was over, I reminded myself of Cliff Lynch’s 2003 definition of the Institutional Repositories, which describes institutional repositories in terms of services, collaboration, distributed operation, commitment. If you haven’t read it, or haven’t read it in a while, go back and take a look, it’s in the 5th para.

Whilst it’s only a view of how things should be, I think it’s a good view, and it neatly sums up what’s important about repository interoperability – it’s about the interaction between systems needed to achieve a repository.

Discussions this morning: –

What are we GETting? How to answer questions like “where’s the license for this resource”, “where’s the thumbnail of this large image”. There was talk of content negotiation services (e.g. http://openurl.myrepo.com/open?resourceid=567&action=license). The alternative (which I strongly favour) is to use descriptive links (e.g. link rel).

Problems and opportunities in utility computing (using EC2 / S3 etc etc). The problems are most often extremely prosaic – persuading the institution to provide a credit card with an unknown spend. Probably the best idea that came out was to use utility computing for a personal repository – your institution covers the costs and adds their branding while you work with them, and you can take your personal repo between repositories easily.

Multiple submission (e.g. to IR + subject repo + RAE tracker etc). As users, we’d like a single submission system for all these systems, e.g. put a presentation in ‘the system’ and have it propagated to slideshare + IR. As an observation, there are huge issues in pass by val (c.f. packaging) / pass by ref (c.f. ORE) that are not going to be resolved soon (probably at all).

P.S. can you guess the theme of the post titles?

[In previous episodes] : I’m at a JISC Common Repositories Interfaces working Group unconference event. This is novel to most of the participants, so we’re learning about the format as well as discussing the issues. The first day consisted of introductions and some brainstorming type exercises designing to bring out issues to take forward into the unconference.

Last night’s networking event was essentially a continuation of the discussion during the day. The change of location was useful, though; a lot of the early conversation was of the form “I’m amazed we didn’t talk about …”. The great thing about the unconference format is that we can fix those problems easily, rather than going home from the conference thinking that it was all very interesting, but didn’t really tackle burning issues.

One of the random thoughts that came up last night: communication between people involves semantic loss. To put it another way, you have a set of meanings you associate with the word “communication”, and so do I. They’re unlikely to be the same, but here we are, you reading (probably wondering why at the moment) and me writing. This isn’t a problem, because we naturally know that this is happening, and have ways of avoiding (rather than preventing) problems – like redundancy (e.g. “To put it another way…”). Perhaps the starting point for any interoperability should be about “good enough” and redundancy should be encouraged?

The first session of the unconference turned out to be a kind of brainstorm to extract pertinent issues from the mindmaps generated through the preparatory chats.

The next step is a round of ‘dotmocracy’, which is a way of getting a bit of consensus on which of these issues people are interested in.

The last chat I was part of brought up the question of why we should bother with digital preservation. The argument against it usually goes that if people find resources useful they will preserve them anyway. I personally think that a kind of public interest theory is applicable due to the fact that current value of a resource is often lower than the future value of a resource – intervention is needed to protect the future value of the resource.

On reflection, though, it’s not the issue we should be discussing at an interoperability meeting. What we should be thinking about is “If someone wanted to preserve the resources in a repository, what interfaces / services would they need to be provided with?”

(I’m blogging this now because I don’t expect preservation to make the cut after dotmocracy).

CRIG Live un-blogging

December 6, 2007

I’m at the JISC CRIG (Common Repositories Interfaces working Group) two-day Unconference today and tomorrow. We’re about to start the unconference proper.

This will all make sense very soon. Just follow me blindly for now.

David Flanders.

Well, here goes…

CRIG Podcasts

November 30, 2007

Near the start of November, I was involved in a series of chats organized by the JISC CRIG support project, aiming to serve as an introduction to various aspects of repository interoperability and to look at possible areas for standardisation, and areas that might benefit from further research. The chats were in the form of conference calls, which were recorded and made into podcasts. They’re now available.

In the GET and PUT chat, Richard and I resurrect a long running discussion we have IRL about granularity and various aspects of resource description, and amongst other things, the potential impact of OAI-ORE and SWORD are discussed. The search chat led by Martin Morrey of Intrallect was very informative, it has a bit of background on Z39.50 and the birth of SRW/U, which happened before I was involved with repositories. The last chat I was involved in was the Identity chat, the main part of which was postponed, but as it stands is a helpful introduction to the FAR project. The full chat went ahead yesterday, and was a good discussion on lots of good stuff around federated access management, identity management and so on. The audio from that chat will be available in due course.