Neat ORE shorthand

February 5, 2008

A useless piece of trivia, perhaps, but since ORE requires resource maps to dereference to the resource map document, the resource map URI is implicitly the base URI for the document. This means you can use relative URIs in the ORE document for the aggregation and the resource map itself. In Turtle, the following would be a valid complete resource map document.

@prefix ore
ore:describes .
ore:aggregates ,


Background: An alpha version of the OAI-ORE specifications was released in December, and has prompted less public discussion than I’d hoped for, so I’m going to post some of the issues as I perceive them in an attempt to promote awareness. I’ll inevitably fail to be comprehensive, so I won’t try – I’ll stick to the ones that interest me.

ORE is a way of describing aggregations of web resources; complex objects in digital library / repository parlance. It’s based on semantic web principles and technology and is RESTful (unlike PMH, but that’s a story for another day), which is a Good Thing.

So what is it good for: –

i) Provides an alternative to content packaging. Content packaging standards and security are two of the biggest hurdles to repository interop. ORE could provide a route around one of them, and bring the repository world closer to the web in doing so.

ii) Takes forward named graphs for defining boundaries on the semantic web. The semantic web can be visualized as a big network of statements about things, that lacks a way of defining a chunk of the network (in order to make statements about it…). You perhaps have to be a bit of a semantic web geek to appreciate the importance of this at first flush.

The alpha of the standard itself stood out for a couple of things too. It seemed to have been written with a mindset of “what is the least we can specify whilst being useful?”. It’s also a well rounded spec; there are constraints to make it simple, but they’re not out of balance with the amount of specification and support provided.

ORE is likely to be important to the repository community; there is a lot of momentum behind it (on which more later), and it provides a piece of perviously-missing infrastructure. So it might well be worth your while to read the spec, join the discussion group and maybe even read some of the following posts…

Roundup 14th Dec P.S.

December 14, 2007

I omitted some important news: OAI-ORE released an alpha spec. I’d urge anyone with an interest in interoperability to read and comment – the definition of compound object boundaries on the semantic web isn’t done fantastically well at the moment and the current idiom of pass-by-val between repositories (with content packages) means a bunch of headaches that pass-by-ref (a la ORE) avoids – so it’s important to get this right early.

Roundup 14th Dec

December 14, 2007

It’s been a tough week for (especially institutional) repositories: – with some of the criticism specific (David Flanders, Dorothea Salo) and some a little more general (“centralization is a bug”). But it’s not all doom and gloom; Paul Walk clears up being taken out of context and Peter Murray-Rust announces the Microsoft e-Chemistry project, which I’m optimistic will make big advances in practical repository interop.

On a complete tangent, I discovered Mark Nottingham’s blog and read with delight about his HTTP caching extensions. My relationship with HTTP has been like one of those infernal teen movies where the in-crowd-but-with-a-conscience kid finally dispels the fog of peer pressure and preconception, and invariably finds the class dork to be the member-of-opposite-sex of dreams. Not that an in-crowd kid could ever utter the words “my relationship with HTTP”.

About a year ago; Peter Murray-Rust showed his research group a web interface that allowed you to type SPARQL into a textarea input and have it evaluated. I had a flashback to people being shown the same thing with SQL years ago. So if SPARQL follows the same pattern, the textareas will disappear so the developers take the complexity of the query language and data model away from the users, then the developers will write enormous libraries (c.f. Object Relational Mapping tools) so they don’t have to deal with the query language either.

Ben O’Sheen recently posted on Linking resources [in Fedora] using RDF, and one part particularly jumped out at me: –

The garden variety query is of the following form:

“Give me the nodes that have some property linking it to a particular node” – i.e. return all the objects in a given collection, find me all the objects that are part of this other object, etc.

I think the common-or-garden query is “I’m interested in uri:foo, show me what you’ve got”, which is the same, but doesn’t require you to know the data model before you make the query. Wouldn’t it be cool to have a tech that gave you the “interesting” sub-graph for any uri? Maybe the developer would have to describe “interestingness” in a class based way, or it could be as specific as templates (I suspect Fresnel could be useful here, but I looked twice and still didn’t really didn’t get it). Whatever solution looks like, I doubt that a query language as general and flexible as SPARQL will be the best basis for it, for the reasons Andy Newman gives – what’s needed is a query language where the result is another graph.

One thing and another meant I was unable to blog final thoughts and summaries about the CRIG unconference that I attended last week, so this is rather long, being a combination of the post I would have written Friday afternoon and a couple of consequent thoughts.

Firstly, on unconferencing, or at least the way it was implemented for CRIG; I like it a lot. Since we were partly a talking shop and partly a workshop to refine interoperability scenarios and challenges, the main session worked essentially like a lightweight breakout session system – topics were assigned to whiteboards, and people chose topics, migrated, discussed and withdrew as they wished. It was leagues more interesting and more productive than being assigned a breakout group with a topic. Successive rounds of dotmocracy helped to sort out the zeitgeist from the soapboxes. I could see this format working extremely well for e.g. the Cambridge BarCamp, or the e-Science All Hands meeting.

This was the first face to face meeting of the CRIG members as CRIG members, and really helped to frame the agenda for CRIG. I realized that there are some big issues underlying repositories that only become really important when discussing interoperability. For example, I can see OAI-ORE creating the same kind of fun and games around pass-by-ref, pass-by-val that the community currently enjoys when discussing identifiers, and just like identifiers, it touches just about every scenario.

One message came out pretty strongly; the emphasis on repositories isn’t useful in itself. One of the topics for discussion that passed dotmocracy (i.e. was voted as something people wanted to talk about) was “Are repositories an evolutionary dead end?”, a theme picked up by David Flanders. Well, I personally don’t think so, but then I’ve probably got a more malleable definition of “the R word” (as Andy Powell puts it) than most. If I’ve read the mood correctly, people are beginning to regard centralized, data storing, single-software, build-it-and-they-will-come IRs as a solution looking for a problem. Some regard repositories as a complete diversion, others that we should act on our improved understanding of the problems in academic content management and dissemination by acknowledging failed experiments and moving on quickly. Nobody gave me the impression that they thought the current approaches would work given a couple more years, more effort or more funding.

This has all been said before; when the conference was over, I reminded myself of Cliff Lynch’s 2003 definition of the Institutional Repositories, which describes institutional repositories in terms of services, collaboration, distributed operation, commitment. If you haven’t read it, or haven’t read it in a while, go back and take a look, it’s in the 5th para.

Whilst it’s only a view of how things should be, I think it’s a good view, and it neatly sums up what’s important about repository interoperability – it’s about the interaction between systems needed to achieve a repository.

Round up 2007-11-16

November 16, 2007

More notables banging the REST drum.

A post by Jon Udell on tiny URLs for web citations, with a good comment from Peter Murray. A persistent redirecting service that automatically caches and preserves content? Throw in some access management and that sounds like a good part of an institutional repository.

I’ve been in a reflective mood about CrystalEye over the last few days. In repository-land where I spend part of my time, OAI-PMH is regarded as a really simple way of getting data from repositories, and approaches like Atom are often regarded as insufficiently featured. So I’ll admit I was a bit surprised about the negative reaction provoked by the idea of CrystalEye only providing incremental data feeds.

The “give me a big bundle of your raw data” request was one I’d heard before, from Rufus Pollock at OKFN, when I was working on the DSpace@Cambridge project, a topic he returned to yesterday, arguing that data projects should put making raw data available as a higher priority than developing “Shiny Front Ends” (SFE).

I agree on the whole. In a previous life working on public sector information systems I often had extremely frustrating conversations with data providers who didn’t see anything wrong in placing access restrictions on data they claimed was publicly available (usually the restriction was that any other gov / NGO could see the data but the public they served couldn’t).

When it comes to the issue with CrystalEye we’re not talking about access restriction, we’re talking about the form the data is made available, and the effort needed to obtain it. This is a familiar motif: –

  • The government has data that’s available if you ask in person, but that’s more effort than we’d like to expend, we’d like it to be downloadable
  • The publishers make (some) publications available as PDF, but analyzing the science requires manual effort, we’d like them to publish the science in a form that’s easier to process and analyze
  • The publishers make (some) data available from their websites, but it’s not easy to crawl the websites to get hold of it – it would be great if they gave us feeds of their latest data
  • CrystalEye makes CML data available, but potential users would prefer us to bundle it up onto DVDs and mail it to them.

Hold on, bit of a role reversal at the end there! Boot’s on the other foot. We have a reasonable reply; we’re a publicly funded research group who happen to believe in Open Data, not a publicly funded data provider. We have to prioritise our resources accordingly, but I still think the principle of providing open access to the raw data applies.

You’ll have to excuse a non-chemist stretching a metaphor: There’s an activation energy between licensing data as open, and making it easy to access and use. CrystalEye has made me wonder how much of this energy has to come from the provider, and how much from the consumer.

While I was working in the real world with Nick on the Atom feeds and harvester for CrystalEye, it seems they became an issue of some contention in the blogosphere. So I’m using this post to lay out why we implemented harvesting this way. These are in strict order of when they occur to me, and I may well be wrong about one or all of them since I haven’t run benchmarks, since getting things working is more important that being right.

This was the quickest way of offering a complete harvest

Big files would be a pain for the server. Our version of Apache uses a thread pool approach, so for the server’s sake I’m more concerned about clients occupying connections for a long time than I am about the bandwidth. The atom docs can be compressed on the fly to reduce the bandwidth, and after the first rush as people fill their crystaleye caches, we’ll hopefully be serving 304s most of the time.

Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.

We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.

I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): –

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)

… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)

Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!

“… to hell with the hierarchies, to hell with forms, to hell with communities and collections. I want a bucket collection that any person signing up with an appropriate email address automatically gets deposit rights to.” — Dorothea Salo, in a post even more stuffed with good ideas of how to shift repository bottlenecks than usual.