Excited about OpenID

November 14, 2008

This week, I’m excited about OpenID, and this blog post contains some disconnected ideas that illustrate why. If you’re not into techie stuff but still want to know why I think OpenID is very cool for repositories, feel free to skip down to “Embargo Management”.

Same User ID On Distributed Services

I want to host code project on googlecode, and with a trac (simple project management for software focused projects) installation on one of our servers here. Use an Apache mod or a trac plugin as a relying party implementation and the Google OpenID server to keep the user ids for issue tickets in trac consistent with the svn commits.

I’d like to do the same with the svn provision at Sourceforge, so I hope they’re thinking of implementing an OpenID server as well as a relying party.

Unifying User Access

At the moment for many of our services, I have to manage user accounts and passwords for our external collaborators (which is a drag). Internal users aren’t a problem since the University runs a Kerberos-style single sign on service of their own devising (Raven), but I have to configure up dual-mode authentication for each new service we offer (which is an even bigger drag).

Ben Harris of the University Computing Service has implemented an unofficial service (CorvID) that offers a OpenID server functionality, using the University’s single-sign on service for authentication.

Putting this together, OpenID could potentially unify access to our services for all our users.

Embargo Management

Most exciting from a repository point of view is the potential OpenID has when applied to embargo management, which we’re thinking about in the ICE-TheOREM project. The scenario goes something like this: – a PhD candidate has several chapters in their thesis they think would make really great manuscripts, and they wish to embargo them until they’ve written the manuscript. And so they apply a hard embargo (i.e. “Don’t release until I say so” on those chapters in their repository), intending to write the papers the next month. Then they get a job in the city (or perhaps, in the current climate, become a plumber), their good intentions are picked up by the infernal road builders and their university ids and e-mail addresses are meticulously removed.

Some months later the manager of the repository is doing a periodic embargo review and wants to release the embargo on this deserted content. Problem 1: How does s/he get in touch with the author? Problem 2: Once s/he has, how can the system be sure that it really is the author? I think we’ve got a potential solution for this using OpenID, and we’re hopefully going to implement a demonstrator in ICE-TheOREM. In a nutshell: Author sets up the embargo management with an OpenID they control (e.g. http://joe.bloggs.name/), delegating to the Uni Server. When the author leaves the Uni they modify their OpenID to delegate to different server (Google, myopenid, whatever) and also updates their e-mail details (maybe using FOAF in RDFa). If they do, then the repo always has a way to get in touch with the author, and can also authenticate them.

When you strip this bare, all that’s going on is the consistent use of URL references to identify and authenticate people across systems, and a layer of indirection through the OpenID delegate system. References. Indirection. Simple tools, but solved a real problem simply.

Happy Idiot Talk

Of course, there are many reasons this won’t happen. There’s many an interop- slip twixt -ability and -ation. As far as I know none of the major repo platforms have OpenID relying party implementations in stable release yet (although I’m sure they’ve all talked about it, and before you’ve finished this post, Ben O’Steen will have it implemented in Fedora). HE institutions committed to Shibboleth might be resistant to the idea of supporting OpenID. Market research shows that user adoption of OpenID is largely restricted to geeks, seemingly because of the user experience.

Still, it’s exciting to find such a neat theoretical solution to a real problem!

I’m going to blog some more substantial notes on last Friday’s RepoCamp as and when time permits. In the meantime, a cool idea and a plea for collaborators.

The RepoCamp involved the announcement of not one, but two developer challenges in the style of the one at Open Repositories 2008. The first is a general challenge (for which I can’t easily find a reference: help please, WoCRIG!) to do something cool involving interoperating systems. The second challenge is specific to the OAI-ORE specification, and involves creating a prototype that makes the usefulness of ORE visible to end-users.

I’ve got a cool idea for this, but I’m going to need to collaborate to get it done in time, so I’m blogging it in the hope that someone with a bit of time on their hands will get in touch.

The idea: a javascript library (or userscript) that follows all the links on a page and if the link is an ORE Resource Map, or if a Resource Map can be auto-discovered from it, the link is decorated with an ORE icon. Clicking the ORE icon pops up a display of the contents of the ORE aggregation, a la Stacks in OS X 10.5.

There are some fun bells and whistles in there;  including making the interface super shiny and minimizing bandwidth.

Anyone want to help out? I was planning to use John Resig’s jQuery and HTML parsing libraries and possibly processing.js.

Last Thursday I attended a JISC workshop on repository architectures. It was a thought provoking day, and I learned a lot. Firstly I learnt that I need to pay more attention to context when quoting people on twitter (sorry, @paulwalk).

Paul Walk kicked off the day by presenting his work on a next generation architecture for repositories. His presentation started off with a number of starting principles and moved on to some diagrams illustrating a specific architecture based on them. As Paul mentions in that blog post, his diagrams and the principles behind them were “robustly challenged”. As far as I remember, the diagrams were challenged more robustly than the principles.

To cut a longish story short, the discussion and the workshop exercises brought up some interesting ideas, but did relatively little to either validate Paul’s architecture diagrams, or to provide a working alternative. Chatting about it over lunch and later over a pint, I was persuaded that we were looking for an abstraction that doesn’t exist, and that the desire for a single generic repository architecture might have led us down the garden path.

Software engineering, being a field that values pragmatic epistemology, has a couple of empirically derived laws that might help to explain why. Firstly, Larry Tesler’s law of the Conservation of Complexity states (in a nutshell) that complexity can’t be removed, just moved around. A natural way to manage this complexity is find abstractions that hide some of it. This, fundamentally, is what the repositories architecture is trying to do – reduce the multiplicity of interests, politics and data-borne activities of HE into a single abstract architecture.

A second empirical law, The Law of Leaky Abstractions, states that all non-trivial abstractions leak. Some of the complexity cannot be hidden behind the abstraction and leaks through. It feels to me that this is what’s happening with repositories at the moment. Our abstraction (centralization, services provided at point of storage etc) fails to cope with real, current complexities. The problem itself is extremely complex, and if anyone really has their head around it, they’ve still got the hard task of communicating it to the whole community so a good shared abstraction can be developed.

I found myself going back to Paul’s starting principles, and concluded that they were a much more constructive framework for thinking about repository issues than the concrete architectures in the diagrams. Paraphrasing the principles: –

  • Move necessary activity to the point of incentive
  • [Terms of reference for IRs]
  • Pass by reference, not by copy
  • Move complexity towards the point of specialisation
  • Expect and accept increasing complexity on the local side of the repository with more sophisticated workflow integration.

With the exception of the point on IRs, they are all forms of guidance on complexity, either where to move it (“Move [metadata] complexity towards the point of specialisation”), or which trade-offs to make (“Pass by reference, not by copy” => “Prefer to deal with the complexities of references than the complexities of duplication”). The reason I like this approach is that different disciplines, institutions and activities (e.g. REF, publication, administration) all have different complexities and different drivers. Perhaps we need a number of different architecture abstractions based on constraints and drivers. Perhaps the idea of an architecture abstraction is premature in this community and we should focus on local solutions (in the sense of ‘minima’ rather than geography). This needn’t end in technical balkanization; the repositories architecture is driven by business models, and focusing on interoperability and the web architecture allows more of the technical discussion to happen in parallel.

To get the ball rolling, I’d like to add a caveat to Paul’s “Move [metadata] complexity towards the point of specialisation”: “… unless it’s there already and it’s harder to recreate than maintain”. Any more?


Andrew McGregor has posted extensive minutes and notes from the meeting.

Good news from Michele Kimpton (DSpace) and Sandy Payette (Fedora). Exerpt: –

Over the last few weeks, we (Michele Kimpton and Sandy Payette) have been discussing the possibilities of our organizations collaborating. …

Thus far, all of the stakeholders we have had the opportunity to talk with have been extremely supportive and excited about the possibility of the Fedora and DSpace communities working together in some capacity.

(full e-mail at end of this post)

As a general principle, it’s great to see a bit more harmony in the OS space rather than increasing balkanization. Exciting as it is, the idea of Fedora + DSpace is not new; it’s been a perennial topic for repository pub chat for a couple of years. At Open Repositories 2008 I revived the idea with a couple of the other DSpace committers, looking for a bit lively debate, and found none; they had evidently been thinking along the same lines. Mark Diggory, Richard Rodgers and Andrius Blažinskas are looking at Fedora integration through Google’s Summer of Code program.

I’m going to set the inter-community aspects of the collaboration aside for a moment, and think how might a Fedora / DSpace software might look? I’ve got four top priorities I’d like to see DSpace address: –

  • Plugin based architecture a la eclipse to ease customisation and maintenance of core application
  • Data model improvement to support important features such as revisioning / versioning, per-file metadata etc
  • UI improvements
  • Interoperability through provision / consumption of RESTful web services

Starting with the last point; the Fedora and DSpace communities are already heavily involved in the development and adoption of interop standards, and I’ve no reason to believe that a tie-up would change that. There may be some efficiencies to be had, but they’re not obvious to me right now

As far as I know, Fedora doesn’t have a plugin architecture that could be used, but using Fedora doesn’t make implementing a plugin architecture any harder.

I’m guessing that Fedora would most likely be used as a back end to DSpace, possibly accessed through the Fedora REST service interface. Since Fedora handles metadata well (using RDF), a Fedora back end would provide more functionality than the current DSpace storage abstractions (e.g. SRB). Hopefully this will allow the DSpace development community to implement functionality enhancements quickly, and focus on Manakin and other UI improvements.

Fedora is extremely flexible regarding data model. The DSpace 1.x data model was a good start, and is moving towards a model useful to most IR usage. This data model work was started through the architectural review and is being continued in the current JISC funded DSpace 2.0 work. The existence of this data model is extremely important in the adoption of descriptive and structural metadata standards such as FRBR and SWAP.

All in all, I’m looking forward to seeing how this collaboration takes off – I count myself in the “supportive and excited” camp. There will be plenty of challenges; for example, release co-ordination has the capacity to cause disproportionate heartache. As does naming: What would a Fedora / DSpace combination be called? How about “Hat Full of Sky“?

Full e-mail:

From: Sandy Payette and Michele Kimpton

Date: May 30, 2008 11:17:18 AM PDT

Subject: Joint discussions on Fedora/DSpace collaboration

Dear members of the DSpace and Fedora communities,

Over the last few weeks, we (Michele Kimpton and Sandy Payette) have been discussing the possibilities of our organizations collaborating. The reasons for exploring the possibilities of collaboration are based on the following:

  1. The missions of our non-profit organizations are very similar and we are motivated to provide the best technology and services to many of the same communities
  2. Over the next 12-18 months, our existing technology roadmaps suggest convergence of thought in several key areas of our architectural visions
  3. We are both motivated to show how our open source repositories offer a unique value proposition compared to proprietary solutions

Over the past couple of weeks, we have had informal discussions with members of our communities, leaders in libraries and higher education, and Board members to get initial feedback as to whether they would support collaboration and the outcomes they would like to see as a result.

This past week, we convened members of both communities during the PASIG conference to get input and ideas regarding a collaboration.

Thus far, all of the stakeholders we have had the opportunity to talk with have been extremely supportive and excited about the possibility of the Fedora and DSpace communities working together in some capacity.

As a result of these discussions, we have agreed to move forward in our exploration of collaborative possibilities. Over the next several weeks our organizations will meet to plan the next steps in the process. Our intent is to bring together the ideas and expertise within both communities to come up with the most compelling issues to work on to best serve our communities.

As we move through this process it is our commitment to ensure that all discussions, meetings and decisions made are transparent and open in the hopes to engage and inform the community.

We look forward to your ideas and inputs!

Best Regards,

Michele and Sandy

The CRIG developer challenge at Open Repositories 2008 was a real success (props due to David Flanders for making it happen). I’m sure the cash prize helped to motivate people, but it can’t account for all that effort, so what was the magic ingredient X? More on that in a moment, but first…

I’d like to add my congratulations to the chorus for Dave, Tim and Ben! I’d also like to make some honourable mentions: –

Oh, and from my point of view, the magic ingredient was “the opportunity to have your work toasted by a room full of your peers, and then around the blogosphere”. Money can’t buy it.

Slides from the presentations I gave at OR08 are now available from DSpace@Cambridge: –

CrystalEye – from desktop to data repository.

Preview of the TheOREM Project.

They should also be appearing (possibly with video, who knows?) at the official conference repo at http://pubs.or08.ecs.soton.ac.uk/.

Savas Parastatidis has announced a research output repository software being developed by the MS technical computing group. I got a sneaky preview from Savas a couple of weeks ago, so I’ve been looking forward to being able to blog this! The UI is lots of fun:

Notes and comments: –

  • It’s based on an RDBMS, but walks and quacks a lot like an RDF triplestore. The design aims to retain the scalability of a well designed RDBMS schema, but gain the flexibility of a triplestore.
  • It’s going to be free (as in beer). Of course, there’s a stack of licensed software (windows server, SQL server etc) you need before you can install it.
  • There’s a suggestion that it may be released under an Open Source license. Whatever license they choose, I think the strongest development community will be built on a good API and plugin management system (sound familiar?). This could work with a closed license, or on a MySQL-type OS model equally well.
  • The team have a strong and (IMO) genuine desire to play nicely with existing interoperability standards, and to participate in the development of interop standards.
  • Will this be a competitor to DSpace, Fedora, ePrints, BePress, Intrallect et al? Of course it will, but what IRs need now is people trying new and different approaches, so a new entrant could give the whole area a fillip.
  • More and more people are going to want to bring their repositories to their data, rather than vice versa. Lots of people store their data on windows servers, in active directories and shared network drives etc. There could be a lot of very quick wins if the team choose to go in that direction.

Andrew Walkingshaw came back from semantic camp brimming with enthusiasm and bearing gifts; stickers bearing a likeness of Roy Fielding and the slogan “Fielding has a posse” and “RFC 2616” (the HTTP 1.1 spec of course!). I could stick it on my trusty powerbook, apparently all the cool semantic/web/2.0 kids have stickers all over their macs these days. My instinct to preserve the pure clean lines is evidently old hat, and as we know, old hat don’t dance.

This is timely since Roy Fielding now has a blog, and there’s been a flurry of RESTful repository discussion in the wake of Andy Powell’s keynote at VALA (responses a, b, c, d).

From Andy’s original post: –

Finally, that the ‘service oriented’ approaches that we have tended to adopt in standards like the OAI-PMH, SRW/SRU and OpenURL sit uncomfortably with the ‘resource oriented’ approach of the Web architecture and the Semantic Web. We need to recognise the importance of REST as an architectural style and adopt a ‘resource oriented’ approach at the technical level when building services.

In the comments there’s the fashionable spat over whether the word “repository” should be pejorative, but I’m surprised nobody’s trodden on the “service-oriented” banana skin. Andy does clarify with “at the technical level” at the end of the point, but care is needed since SOA is a historically infamous weasel phrase: –

“… that’s a service oriented approach for you.”

“I don’t know what you mean by service oriented approach” said Alice

Humpty Dumpty smiled contemptuously “Of course you don’t, until I tell you.”

Repositories Thru the Looking Glass, missing chapter, with apologies to A. Powell and L. Carroll

There are (at least) three distinct meanings of “service oriented” in the repositories context.

The Good
Services as in “a set of services that a university offers to the members of its community for the management and dissemination of digital materials” (Cliff Lynch).

The Bad
Protocols such as those Andy mentions (OAI-PMH, SRW/SRU, OpenURL). These are also sometimes referred to as STREST interfaces (Service Trampled REST) as they work using the same URL and HTTP mechanisms as REST, but do so in a way that doesn’t take advantage of the web architecture (or rather, that doesn’t observe the constraints of the web architecture).
The Ugly
Snake Oil Architecture. SOAP, WSDL, WS-*, standards documentation as thick as your arm. Bleuch.

At a certain level, thinking about services makes sense. The mistake is to be too literal and carry it through to implementation. The JISC e-Framework animation describing SOA looks like they were thinking about resource oriented services – it’s all about common formats, GET and PUT. From a techie’s point of view, your manager can take a Service Oriented Approach and you can implement it RESTfully.

A quote from Sam Ruby

February 13, 2008

“…the ease with which a Ruby client (or a Python one) can be wired up to a Java middle tier talking to a Erlang back end using only HTTP, Atom, and JSON is a testament to the power of these simple protocols and formats.”

Sam Ruby

Substitute in some repository software names in for the programming languages. That’s why we had a barcamp on RESTful repository interfaces last week.

CrystalEye is a repository of crystallographic data. It’s built by a software system written by Nick Day that uses sections of Jumbo and CDK for functionality. It isn’t feasible for Nick to curate all this data (>100,000 structures) manually, and software bugs are a fact of life, so errors creep in.

Egon Willighagen and Antony Williams (ChemSpiderMan) have been looking at the CrystalEye data, and have used their blogs (as well as commenting on PM-R’s) to feed issues back. This is a great example of community data checking. Antony suggested that we implement a “Post a comment” feature on each page to make feedback easier. This is a great idea, so we had a quick think about it and propose a web2.0 alternative mechanism: Connotea.

To report a problem in CrystalEye, simply bookmark an example of the problem with the tag “crystaleyeproblem”, using the Description field to describe the problem. All the problems will appear on the tag feed.

When we fix the problem we’ll add the tag “crystaleyefixed” to the same bookmark. If you subscribe to this feed, you’ll know to remove the crystaleyeproblem tag.

In the fullness of time, we’re planning to use connotea tags to annotate structures where full processing hasn’t been possible (uncalculatable bond orders, charges etc).