Atom History vs Large Archive Files

November 5, 2007

While I was working in the real world with Nick on the Atom feeds and harvester for CrystalEye, it seems they became an issue of some contention in the blogosphere. So I’m using this post to lay out why we implemented harvesting this way. These are in strict order of when they occur to me, and I may well be wrong about one or all of them since I haven’t run benchmarks, since getting things working is more important that being right.

This was the quickest way of offering a complete harvest

Big files would be a pain for the server. Our version of Apache uses a thread pool approach, so for the server’s sake I’m more concerned about clients occupying connections for a long time than I am about the bandwidth. The atom docs can be compressed on the fly to reduce the bandwidth, and after the first rush as people fill their crystaleye caches, we’ll hopefully be serving 304s most of the time.

Incremental harvest is a requirement for data repositories, and the “web-way” is to do it through the uniform interface (HTTP), and connected resources.

We don’t have the resource to provide DVD’s of content for everyone who wants the data. Or turning that around – we hope more people will want the data than we have resource to provide for. This is isn’t about the cost of a DVD, or the cost of postage, it’s about manpower, which costs orders of magnitude more than bits of plastic and stamps.

I’ve particularly valued Andrew Dalke’s input on this subject (and I’d love to kick off a discussion on the idea of versioning in CrystalEye, but I don’t have time right now): –

However, I would suggest that the experience with GenBank and other bioinformatics data sets, as well as PubChem, has been that some sort of bulk download is useful. As a consumer of such data I prefer fetching the bulk data for my own use. It makes more efficient bandwidth use (vs. larger numbers of GET requests, even with HTTP 1.1 pipelining), it compresses better, I’m more certain about internal integrity, and I can more quickly get up and working because I can just point an ftp or similar client at it. When I see a data provider which requires scraping or record-by-record retrieval I feel they don’t care as much about letting others play in their garden.

(Andrew Dalke)

… and earlier …

… using a system like Amazon’s S3 makes it easy to distribute the data, and cost about US $20 for the bandwidth costs of a 100GB download. (You would need to use multiple files because Amazon has a 5GB cap on file size.) Using S3 would not affect your systems at all, except for the one-shot upload time and the time it would take to put such a system into place.

(Andrew Dalke)

Completely fair points. I’ll certainly look at implementing a system to offer access through S3, although everyone might have to be even more patient than they have been for these Atom feeds. We do care about making this data available – compare the slight technical difficulties in implementing an Atom harvester with the time and effort it’s taken Nick to implement and maintain spiders to get this data from the publishers in order to make it better available!


One Response to “Atom History vs Large Archive Files”

  1. […] PMR: Many people don’t think of things like this when they write spiders. Jim’s software should do all the thinking for you. OJD: Atom History vs Large Archive Files – November 5th, 2007 […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: