Web feeds and repositories

December 10, 2008

I was invited to give a presentation on RSS and Atom as part of a SUETr workshop on interoperability yesterday. Of course I didn’t even scratch the surface of what can be achieved with feeds in terms of mash-ups, 3rd party sites and visualisations – but I did try to get across the breadth of ‘repository’ problems feeds can address, and the importance of feeds in easy wins to add value to your repository efforts (this theme courtesy of Les Carr on his blog).

The slides can be downloaded from either of these places: –

Advertisements

Slides from the presentations I gave at OR08 are now available from DSpace@Cambridge: –

CrystalEye – from desktop to data repository.

Preview of the TheOREM Project.

They should also be appearing (possibly with video, who knows?) at the official conference repo at http://pubs.or08.ecs.soton.ac.uk/.

CrystalEye is a repository of crystallographic data. It’s built by a software system written by Nick Day that uses sections of Jumbo and CDK for functionality. It isn’t feasible for Nick to curate all this data (>100,000 structures) manually, and software bugs are a fact of life, so errors creep in.

Egon Willighagen and Antony Williams (ChemSpiderMan) have been looking at the CrystalEye data, and have used their blogs (as well as commenting on PM-R’s) to feed issues back. This is a great example of community data checking. Antony suggested that we implement a “Post a comment” feature on each page to make feedback easier. This is a great idea, so we had a quick think about it and propose a web2.0 alternative mechanism: Connotea.

To report a problem in CrystalEye, simply bookmark an example of the problem with the tag “crystaleyeproblem”, using the Description field to describe the problem. All the problems will appear on the tag feed.

When we fix the problem we’ll add the tag “crystaleyefixed” to the same bookmark. If you subscribe to this feed, you’ll know to remove the crystaleyeproblem tag.

In the fullness of time, we’re planning to use connotea tags to annotate structures where full processing hasn’t been possible (uncalculatable bond orders, charges etc).

A couple of weeks ago, after using Blogbridge for around 18 months, I packed up my OPML and went over to Google Reader. This post is summarizes my experiences.

I chose BlogBridge in the first place because I wanted a reader capable of offline reading (which meant a desktop app at the time), and needed something that a) ran on both linux and os x and b) synch’d between installations. I switched to Google Reader for one reason: BlogBridge doesn’t synch which items I’ve read.

Google Reader Pros

Subscription suggestions. Google’s suggestions are OK – but mainly for finding the subscriptions you should have aggregated anyway rather than for finding interesting but offbeat blogs.

It remembers what I’ve read. It was painful in BlogBridge to switch to my laptop for the first time after a week and seeing thousands of unread items you’ve spent the week reading and clearing down.

There’s something about the rendering that means I end up reading more articles in the feed reader, which is kind of the point. This is probably because it works in the browser window, so I give lots of screen real estate to the browser. With BlogBridge, the app and the browser had to share the real estate, especially on OS X where I had to click on the app again to regain focus (Windows and OS X users probably don’t understand how annoying this is to some linux users).

Reader + Gears is as good as a desktop app, which is the point, of course!

The subscription bookmarklet means I’m more likely to subscribe to things I find interesting. Which should be a good thing.

I use the article star to indicate “come back and read in more depth”, which works well.

Google Reader Cons

Authenticated feeds. Reader doesn’t have them, but frankly, if it did I wouldn’t use them (Google knows enough about me without me giving them my passwords). I’ve realized how important the few authenticated feeds were to me, so I’m going to be running BlogBridge again, just for them.

Prioritisation. I used guides in BlogBridge to tier my feeds – I’d work my way down the list of guides until I’d run out of blog reading time. I could have used the feed starring mechanism to do the same thing. Reader simply doesn’t give me the tools to prioritize 162 subscriptions.

Trends. When it comes to attention data, blog reading stats are solid gold. Reader’s Trends console is cute, but isn’t giving me a lot back for my attention data. Where’s the tool that automatically prioritizes my feeds in order of which I’m most likely to find interesting? Where’s the management tool that points out I haven’t read a certain feed in months so I could think about de-subscribing? Where’s the XML download that allows me to get my attention data back from Google?

About a year ago; Peter Murray-Rust showed his research group a web interface that allowed you to type SPARQL into a textarea input and have it evaluated. I had a flashback to people being shown the same thing with SQL years ago. So if SPARQL follows the same pattern, the textareas will disappear so the developers take the complexity of the query language and data model away from the users, then the developers will write enormous libraries (c.f. Object Relational Mapping tools) so they don’t have to deal with the query language either.

Ben O’Sheen recently posted on Linking resources [in Fedora] using RDF, and one part particularly jumped out at me: –

The garden variety query is of the following form:

“Give me the nodes that have some property linking it to a particular node” – i.e. return all the objects in a given collection, find me all the objects that are part of this other object, etc.

I think the common-or-garden query is “I’m interested in uri:foo, show me what you’ve got”, which is the same, but doesn’t require you to know the data model before you make the query. Wouldn’t it be cool to have a tech that gave you the “interesting” sub-graph for any uri? Maybe the developer would have to describe “interestingness” in a class based way, or it could be as specific as templates (I suspect Fresnel could be useful here, but I looked twice and still didn’t really didn’t get it). Whatever solution looks like, I doubt that a query language as general and flexible as SPARQL will be the best basis for it, for the reasons Andy Newman gives – what’s needed is a query language where the result is another graph.

WS-WTF

November 19, 2007

Thinking about interoperability last week I realised that in developing anti-RESTful works-despite-the-web applications, I quite possibly did human progress more harm than good. I wasn’t the worst, by a long chalk, but it made me feel baneful.

Toby provided a tonic: How do you know when you’re solving the wrong problem? When your solution involves a 133 page standard with a section entitled “Human Task Behavior and State Transitions”, just to allow a system to give tasks to people.

I had planned to co-author a number of posts on CrystalEye with Nick Day, starting with the basic functionality in the web interface and moving on to the features in the new Atom feed. As things turned out Nick is rather busy with other things, the data archiving stuff caught everyone’s intention and my well laid plans ganged (gung?), as aft they do, agly (as Burns might have put it). Consequently I’m going to shove out CrystalEye posts as and when.

The point of this post is simply to demonstrate that Atom’s extensibility provides a way to combine several functionalities in the same feed, with the subtext that this makes it a promising alternative to CMLRSS. I’ve already written how the Atom feed can be used for data harvesting. This is something of a niche feature for a minority, though. The big news about the CrystalEye Atom feed is that it looks good in normal feed readers.

As a demonstration, here’s a CrystalEye CMLRSS feed in my aggregator: –

Text. Nice. Of course, I need a chemistry aggregator (like the one in Bioclipse) to make sense of a chemistry feed, right? Nope. Atom allows HTML content, so as well as including CML enclosures for chemistry aware aggregators, you can include diagrams: –

To quote PM-R: “Look – chemistry!”

Conditional GET in Restlet

November 5, 2007

This is an extension to an old-but-useful post on implementing conditional GET in Java. I’ve been using the Restlet library more and more, and had some problems working out how to implement conditional GET, so here’s a brief recipe.


import org.restlet.Client;
import org.restlet.data.Conditions;
import org.restlet.data.Method;
import org.restlet.data.Protocol;
import org.restlet.data.Request;
import org.restlet.data.Response;

...
Request get = new Request(Method.GET, "http://www.example.com/");
get.getConditions().setModifiedSince(lastModified);
Client client = new Client(Protocol.HTTP);
Response res = client.handle(get);

While I was working in the real world with Nick on the Atom feeds and harvester for CrystalEye, it seems they became an issue of some contention in the blogosphere. So I’m using this post to lay out why we implemented harvesting this way. These are in strict order of when they occur to me, and I may well be wrong about one or all of them since I haven’t run benchmarks, since getting things working is more important that being right.

This was the quickest way of offering a complete harvest

Big files would be a pain for the server. Our version of Apache uses a thread pool approach, so for the server’s sake I’m more concerned about clients occupying connections for a long time than I am about the bandwidth. The atom docs can be compressed on the fly to reduce the bandwidth, and after the first rush as people fill their crystaleye caches, we’ll hopefully be serving 304s most of the time.

Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.

We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.

I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): –

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)

… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)

Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!

One of the features of the Crystaleye atom feeds is that they can be used for harvesting data from the system. This is not a feature of Atom syndication itself, but of proposed standard extension (RFC5005). So what does it look like?

RFC5005 specifies three different types of historical feed, we’re only interested at the moment in “Archived feeds”. An archived feed document must include an element like this: –


Basic harvesting is achieved extremely simply, get hold of the latest feed document from http://wwmm.ch.cam.ac.uk/crystaleye/feed/atom/feed.xml, and iterate through the entries. Each entry contains (amongst other things), a unique identifier (a URN UUID), and a link to the CML file: –

...
urn:uuid:bedc0edd-fab1-4e12-9d45-7ab23aaa02d5
2007-10-15T17:25:53Z
...

So getting the data is just a matter of doing a little XPath or DOM descent and using the link href to GET the data. When you’ve got all the entries, you need to follow a link to the previous (next oldest) feed document in the archive, encoded like this: –


(This ‘prev-archive’ rel is the special sauce added by RFC5005). Incremental harvesting is done by the same mechanism, but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: –

  • The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.
  • The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.
  • The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using


java -jar crystaleye-harvester.jar [directory to stick data in]

Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests.

Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.

If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.