Web feeds and repositories

December 10, 2008

I was invited to give a presentation on RSS and Atom as part of a SUETr workshop on interoperability yesterday. Of course I didn’t even scratch the surface of what can be achieved with feeds in terms of mash-ups, 3rd party sites and visualisations – but I did try to get across the breadth of ‘repository’ problems feeds can address, and the importance of feeds in easy wins to add value to your repository efforts (this theme courtesy of Les Carr on his blog).

The slides can be downloaded from either of these places: –

Advertisements

[I’ve only just found a good mechanism and time to listen to podcasts, so this is a little after the event, still worthwhile I hope.]

Earlier this month Richard Wallis of Talis interviewed JP Rangaswami at BT, and posted a podcast of the conversation. Sterling stuff – I thoroughly recommend listening to it in full. I’ve pulled out some of the bits as quotes here.

If you work in a very vendor dominated world you can abdicate responsibility for
a lot of what you do by transferring not just the risk, but the worldreward to the
vendor. That doesn’t scale any more.

If a problem is generic, look to the open source community to solve it. If it’s
a narrow market for the problem … then look to the commercial environment to
solve it. If it is unique to your enterprise, you’d better solve it yourself,
because no-one else is going to solve it for you.

We’ve lived through a whole generation of mistakes when we had proprietary
architectures for the way we had information in enterprises. First you paid
money to completely drown the information in concrete, then you paid money to
dig it out to move it somewhere else. That’s what enterprise application
integration looked like, spending money to sticking it into
somebody’s silo then spending even more money taking it out of silos. Instead of
exposing data you were excavating data, and paying for the privilege of your own
data. That is the danger we face if we don’t get issues to do with identity,
with authentication and permissioning, with intellectual property rights correct
in this generation. Because we will end up repeatedly wasting money digging out
stuff that should have been made available much more cheaply because the costs
of reproduction and transmission are going down.

Wishing I could self-replicate and get to Online Information (as well as going to the DCC Conference) to hear JP speak there!

Excited about OpenID

November 14, 2008

This week, I’m excited about OpenID, and this blog post contains some disconnected ideas that illustrate why. If you’re not into techie stuff but still want to know why I think OpenID is very cool for repositories, feel free to skip down to “Embargo Management”.

Same User ID On Distributed Services

I want to host code project on googlecode, and with a trac (simple project management for software focused projects) installation on one of our servers here. Use an Apache mod or a trac plugin as a relying party implementation and the Google OpenID server to keep the user ids for issue tickets in trac consistent with the svn commits.

I’d like to do the same with the svn provision at Sourceforge, so I hope they’re thinking of implementing an OpenID server as well as a relying party.

Unifying User Access

At the moment for many of our services, I have to manage user accounts and passwords for our external collaborators (which is a drag). Internal users aren’t a problem since the University runs a Kerberos-style single sign on service of their own devising (Raven), but I have to configure up dual-mode authentication for each new service we offer (which is an even bigger drag).

Ben Harris of the University Computing Service has implemented an unofficial service (CorvID) that offers a OpenID server functionality, using the University’s single-sign on service for authentication.

Putting this together, OpenID could potentially unify access to our services for all our users.

Embargo Management

Most exciting from a repository point of view is the potential OpenID has when applied to embargo management, which we’re thinking about in the ICE-TheOREM project. The scenario goes something like this: – a PhD candidate has several chapters in their thesis they think would make really great manuscripts, and they wish to embargo them until they’ve written the manuscript. And so they apply a hard embargo (i.e. “Don’t release until I say so” on those chapters in their repository), intending to write the papers the next month. Then they get a job in the city (or perhaps, in the current climate, become a plumber), their good intentions are picked up by the infernal road builders and their university ids and e-mail addresses are meticulously removed.

Some months later the manager of the repository is doing a periodic embargo review and wants to release the embargo on this deserted content. Problem 1: How does s/he get in touch with the author? Problem 2: Once s/he has, how can the system be sure that it really is the author? I think we’ve got a potential solution for this using OpenID, and we’re hopefully going to implement a demonstrator in ICE-TheOREM. In a nutshell: Author sets up the embargo management with an OpenID they control (e.g. http://joe.bloggs.name/), delegating to the Uni Server. When the author leaves the Uni they modify their OpenID to delegate to different server (Google, myopenid, whatever) and also updates their e-mail details (maybe using FOAF in RDFa). If they do, then the repo always has a way to get in touch with the author, and can also authenticate them.

When you strip this bare, all that’s going on is the consistent use of URL references to identify and authenticate people across systems, and a layer of indirection through the OpenID delegate system. References. Indirection. Simple tools, but solved a real problem simply.

Happy Idiot Talk

Of course, there are many reasons this won’t happen. There’s many an interop- slip twixt -ability and -ation. As far as I know none of the major repo platforms have OpenID relying party implementations in stable release yet (although I’m sure they’ve all talked about it, and before you’ve finished this post, Ben O’Steen will have it implemented in Fedora). HE institutions committed to Shibboleth might be resistant to the idea of supporting OpenID. Market research shows that user adoption of OpenID is largely restricted to geeks, seemingly because of the user experience.

Still, it’s exciting to find such a neat theoretical solution to a real problem!

Another SWORD draft

September 19, 2008

Another draft of the SWORD 1.3 spec. Mainly small revisions, and added explanation where needed. I think this is likely to be the working revision for the SWORD2 project.

Last Thursday I attended a JISC workshop on repository architectures. It was a thought provoking day, and I learned a lot. Firstly I learnt that I need to pay more attention to context when quoting people on twitter (sorry, @paulwalk).

Paul Walk kicked off the day by presenting his work on a next generation architecture for repositories. His presentation started off with a number of starting principles and moved on to some diagrams illustrating a specific architecture based on them. As Paul mentions in that blog post, his diagrams and the principles behind them were “robustly challenged”. As far as I remember, the diagrams were challenged more robustly than the principles.

To cut a longish story short, the discussion and the workshop exercises brought up some interesting ideas, but did relatively little to either validate Paul’s architecture diagrams, or to provide a working alternative. Chatting about it over lunch and later over a pint, I was persuaded that we were looking for an abstraction that doesn’t exist, and that the desire for a single generic repository architecture might have led us down the garden path.

Software engineering, being a field that values pragmatic epistemology, has a couple of empirically derived laws that might help to explain why. Firstly, Larry Tesler’s law of the Conservation of Complexity states (in a nutshell) that complexity can’t be removed, just moved around. A natural way to manage this complexity is find abstractions that hide some of it. This, fundamentally, is what the repositories architecture is trying to do – reduce the multiplicity of interests, politics and data-borne activities of HE into a single abstract architecture.

A second empirical law, The Law of Leaky Abstractions, states that all non-trivial abstractions leak. Some of the complexity cannot be hidden behind the abstraction and leaks through. It feels to me that this is what’s happening with repositories at the moment. Our abstraction (centralization, services provided at point of storage etc) fails to cope with real, current complexities. The problem itself is extremely complex, and if anyone really has their head around it, they’ve still got the hard task of communicating it to the whole community so a good shared abstraction can be developed.

I found myself going back to Paul’s starting principles, and concluded that they were a much more constructive framework for thinking about repository issues than the concrete architectures in the diagrams. Paraphrasing the principles: –

  • Move necessary activity to the point of incentive
  • [Terms of reference for IRs]
  • Pass by reference, not by copy
  • Move complexity towards the point of specialisation
  • Expect and accept increasing complexity on the local side of the repository with more sophisticated workflow integration.

With the exception of the point on IRs, they are all forms of guidance on complexity, either where to move it (“Move [metadata] complexity towards the point of specialisation”), or which trade-offs to make (“Pass by reference, not by copy” => “Prefer to deal with the complexities of references than the complexities of duplication”). The reason I like this approach is that different disciplines, institutions and activities (e.g. REF, publication, administration) all have different complexities and different drivers. Perhaps we need a number of different architecture abstractions based on constraints and drivers. Perhaps the idea of an architecture abstraction is premature in this community and we should focus on local solutions (in the sense of ‘minima’ rather than geography). This needn’t end in technical balkanization; the repositories architecture is driven by business models, and focusing on interoperability and the web architecture allows more of the technical discussion to happen in parallel.

To get the ball rolling, I’d like to add a caveat to Paul’s “Move [metadata] complexity towards the point of specialisation”: “… unless it’s there already and it’s harder to recreate than maintain”. Any more?

Update

Andrew McGregor has posted extensive minutes and notes from the meeting.

Stuart Lewis on using the stackable authentication in DSpace to use Shibboleth to protect a SWORD interface.

Good news from Michele Kimpton (DSpace) and Sandy Payette (Fedora). Exerpt: –

Over the last few weeks, we (Michele Kimpton and Sandy Payette) have been discussing the possibilities of our organizations collaborating. …

Thus far, all of the stakeholders we have had the opportunity to talk with have been extremely supportive and excited about the possibility of the Fedora and DSpace communities working together in some capacity.

(full e-mail at end of this post)

As a general principle, it’s great to see a bit more harmony in the OS space rather than increasing balkanization. Exciting as it is, the idea of Fedora + DSpace is not new; it’s been a perennial topic for repository pub chat for a couple of years. At Open Repositories 2008 I revived the idea with a couple of the other DSpace committers, looking for a bit lively debate, and found none; they had evidently been thinking along the same lines. Mark Diggory, Richard Rodgers and Andrius BlaĹžinskas are looking at Fedora integration through Google’s Summer of Code program.

I’m going to set the inter-community aspects of the collaboration aside for a moment, and think how might a Fedora / DSpace software might look? I’ve got four top priorities I’d like to see DSpace address: –

  • Plugin based architecture a la eclipse to ease customisation and maintenance of core application
  • Data model improvement to support important features such as revisioning / versioning, per-file metadata etc
  • UI improvements
  • Interoperability through provision / consumption of RESTful web services

Starting with the last point; the Fedora and DSpace communities are already heavily involved in the development and adoption of interop standards, and I’ve no reason to believe that a tie-up would change that. There may be some efficiencies to be had, but they’re not obvious to me right now

As far as I know, Fedora doesn’t have a plugin architecture that could be used, but using Fedora doesn’t make implementing a plugin architecture any harder.

I’m guessing that Fedora would most likely be used as a back end to DSpace, possibly accessed through the Fedora REST service interface. Since Fedora handles metadata well (using RDF), a Fedora back end would provide more functionality than the current DSpace storage abstractions (e.g. SRB). Hopefully this will allow the DSpace development community to implement functionality enhancements quickly, and focus on Manakin and other UI improvements.

Fedora is extremely flexible regarding data model. The DSpace 1.x data model was a good start, and is moving towards a model useful to most IR usage. This data model work was started through the architectural review and is being continued in the current JISC funded DSpace 2.0 work. The existence of this data model is extremely important in the adoption of descriptive and structural metadata standards such as FRBR and SWAP.

All in all, I’m looking forward to seeing how this collaboration takes off – I count myself in the “supportive and excited” camp. There will be plenty of challenges; for example, release co-ordination has the capacity to cause disproportionate heartache. As does naming: What would a Fedora / DSpace combination be called? How about “Hat Full of Sky“?

Full e-mail:

From: Sandy Payette and Michele Kimpton

Date: May 30, 2008 11:17:18 AM PDT

Subject: Joint discussions on Fedora/DSpace collaboration

Dear members of the DSpace and Fedora communities,

Over the last few weeks, we (Michele Kimpton and Sandy Payette) have been discussing the possibilities of our organizations collaborating. The reasons for exploring the possibilities of collaboration are based on the following:

  1. The missions of our non-profit organizations are very similar and we are motivated to provide the best technology and services to many of the same communities
  2. Over the next 12-18 months, our existing technology roadmaps suggest convergence of thought in several key areas of our architectural visions
  3. We are both motivated to show how our open source repositories offer a unique value proposition compared to proprietary solutions

Over the past couple of weeks, we have had informal discussions with members of our communities, leaders in libraries and higher education, and Board members to get initial feedback as to whether they would support collaboration and the outcomes they would like to see as a result.

This past week, we convened members of both communities during the PASIG conference to get input and ideas regarding a collaboration.

Thus far, all of the stakeholders we have had the opportunity to talk with have been extremely supportive and excited about the possibility of the Fedora and DSpace communities working together in some capacity.

As a result of these discussions, we have agreed to move forward in our exploration of collaborative possibilities. Over the next several weeks our organizations will meet to plan the next steps in the process. Our intent is to bring together the ideas and expertise within both communities to come up with the most compelling issues to work on to best serve our communities.

As we move through this process it is our commitment to ensure that all discussions, meetings and decisions made are transparent and open in the hopes to engage and inform the community.

We look forward to your ideas and inputs!

Best Regards,

Michele and Sandy

ODF fillip

May 22, 2008

I’ve resisted getting drawn into the OOXML scrap over on PM-R’s blog; partly because I’ve had plenty to do, and mostly because I don’t think another partly informed opinion would add much to the debate.

Our approach to text mining is necessarily pragmatic, which changes your outlook significantly (for detailed reasons why, read Peter Sefton’s blog). OOXML may be a flawed spec born of a standardisation process that left its participants disenchanted and angry. It may be that OOXML can only ever be implemented meaningfully by Word. The fact remains that most chemists, most people, use Word to create documents.

Which is why the news that Office 2007 SP2 introduces native support for ODF natively is brightening my day.

ODF has a potential upside in expanding interoperability, but as always, business continuity requirements will have a significant effect on our approach to these file format changes.

Gray Knowlton

When ODF became a standard I hoped Microsoft would see the business advantage of open data specs and interoperability, and start playing along. Looks like we’re getting there.

The CRIG developer challenge at Open Repositories 2008 was a real success (props due to David Flanders for making it happen). I’m sure the cash prize helped to motivate people, but it can’t account for all that effort, so what was the magic ingredient X? More on that in a moment, but first…

I’d like to add my congratulations to the chorus for Dave, Tim and Ben! I’d also like to make some honourable mentions: –

Oh, and from my point of view, the magic ingredient was “the opportunity to have your work toasted by a room full of your peers, and then around the blogosphere”. Money can’t buy it.

Slides from the presentations I gave at OR08 are now available from DSpace@Cambridge: –

CrystalEye – from desktop to data repository.

Preview of the TheOREM Project.

They should also be appearing (possibly with video, who knows?) at the official conference repo at http://pubs.or08.ecs.soton.ac.uk/.