Using the Crystaleye Atom feed

November 5, 2007

One of the features of the Crystaleye atom feeds is that they can be used for harvesting data from the system. This is not a feature of Atom syndication itself, but of proposed standard extension (RFC5005). So what does it look like?

RFC5005 specifies three different types of historical feed, we’re only interested at the moment in “Archived feeds”. An archived feed document must include an element like this: –


Basic harvesting is achieved extremely simply, get hold of the latest feed document from http://wwmm.ch.cam.ac.uk/crystaleye/feed/atom/feed.xml, and iterate through the entries. Each entry contains (amongst other things), a unique identifier (a URN UUID), and a link to the CML file: –

...
urn:uuid:bedc0edd-fab1-4e12-9d45-7ab23aaa02d5
2007-10-15T17:25:53Z
...

So getting the data is just a matter of doing a little XPath or DOM descent and using the link href to GET the data. When you’ve got all the entries, you need to follow a link to the previous (next oldest) feed document in the archive, encoded like this: –


(This ‘prev-archive’ rel is the special sauce added by RFC5005). Incremental harvesting is done by the same mechanism, but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: –

  • The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.
  • The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.
  • The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using


java -jar crystaleye-harvester.jar [directory to stick data in]

Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests.

Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.

If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.

Advertisements

2 Responses to “Using the Crystaleye Atom feed”


  1. […] Since Atom may not be familiar to everyone Jim Downing has written two expositions on his blog. These explain his thinking of why a series of medium-sized chunks is a better way to support the download of CrystalEye than one or two giant files. Note that he is working on making available some Java code to help with the download – this should do the caching and remember where you left off. If you have technical questions I suggest you leave them on Jim’s blog. If you want to help the project in general use my blog. If you want to hurry the process along by mailing Jim, please refrain. He works very well on occasional beers (he is a brewing aficionado). Using the Crystaleye Atom feed – November 5th, 2007 Incremental harvesting is done by [the same mechanism], but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: – […]


  2. […] The point of this post is simply to demonstrate that Atom’s extensibility provides a way to combine several functionalities in the same feed, with the subtext that this makes it a promising alternative to CMLRSS. I’ve already written how the Atom feed can be used for data harvesting. This is something of a niche feature for a minority, though. The big news about the CrystalEye Atom feed is that it looks good in normal feed readers. […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: