CRIG Podcasts

November 30, 2007

Near the start of November, I was involved in a series of chats organized by the JISC CRIG support project, aiming to serve as an introduction to various aspects of repository interoperability and to look at possible areas for standardisation, and areas that might benefit from further research. The chats were in the form of conference calls, which were recorded and made into podcasts. They’re now available.

In the GET and PUT chat, Richard and I resurrect a long running discussion we have IRL about granularity and various aspects of resource description, and amongst other things, the potential impact of OAI-ORE and SWORD are discussed. The search chat led by Martin Morrey of Intrallect was very informative, it has a bit of background on Z39.50 and the birth of SRW/U, which happened before I was involved with repositories. The last chat I was involved in was the Identity chat, the main part of which was postponed, but as it stands is a helpful introduction to the FAR project. The full chat went ahead yesterday, and was a good discussion on lots of good stuff around federated access management, identity management and so on. The audio from that chat will be available in due course.

Java Resource Listing

November 30, 2007

I’ve been working on a SWORD client for SPECTRa for the last day or so, and an got a little sidetracked into mavenizing the SWORD Java code, and further sidetracked into refactoring some of it as I went. Part of the SWORD code is jar including a CLI and a Swing GUI. Far maximum convenience the code and dependencies are assembled into a single jar (run using “java -jar …”).

The original author of the code, Neil Taylor (Aberystwyth) has been careful throughout the code to access all resources through the ClassLoader.getResource[AsStream] methods, through the InputStream and URL abstractions. So far so froody. There’s a wrinkle, though – the help system launches the user’s browser with the location of the help index file as an argument, and this is a limitation to the “everything in the jar file” approach – the code needs to executed from the correct pwd (or passed a parameter) for the help to display correctly.

Most (all?) web browsers are unable to understand the “jar:file:” protocol to get hold of the help pages directly from the jar. Well, I thought, that’s not a problem, I’ll copy the resources out of the jar into a tmp directory and point the browser there. Well, this would work fine, but I hit a snag – there’s no way to list, search for or glob resources through the ClassLoader. I’d have to have an explicit list of all the help resources, which would suck. Sam Adams suggested a solution he used for JNI-InChI: pull the jar file location out of the “jar:file:” URL, then use the java.util.jar.JarFile class to find the relevant entries.

It’s verbose and hacky (in a bad way), but it does allow you to have filesystem-like handling of resources and still distribute as a single executable jar, which is a good thing. Here’s the code, in all it’s filthy glory: –

ClassLoader cl = getClass().getClassLoader();
URL help = cl.getResource("help");
if ("file".equals(help.getProtocol())) {
File from = new File(help.toURI());
FileUtils.copyDirectory(from, helpDir);
} else if ("jar".equals(help.getProtocol())) {
// Strip between 'jar:file:'
String jarLoc = help.toString().substring(9,
File f = new File(jarLoc);
JarFile jarFile = new JarFile(jarLoc);
for (Enumeration entries = jarFile.entries();
entries.hasMoreElements();) {
JarEntry je = entries.nextElement();
if (je.getName().startsWith("help/")) {
// Trim the 'help/' off and fix up the file separators
String filename = je.getName().substring(5).replaceAll(
"/", File.separator);
File destination = new File(helpDir, filename);
File directory = je.isDirectory() ? destination
: destination.getParentFile();
log.debug("Creating " + directory
+ " and copying resource to " + destination);
if (!(directory.exists() || directory.mkdirs())) {
throw new IOException(
"Problem creating temp help directory, couldn't "
+"create:"+ directory);
if (!je.isDirectory()) {

I’ve had a little more success in getting the new location feature of Google Maps Mobile working.

From home I had a location to within 1700m (which is pretty much the whole village). Oh well, perhaps the whole village is supplied by one powerful cell (unlikely, I suspect). No location on the train travelling in. It found my location again when I reached Cambridge station. Again to 1700m. Hmmm. There must be plenty of cells around there. I wondered whether my phone was just sticking to a distant cell, so turned it off and on hoping it would pick up the nearest cell. No location.

Mobile cells don’t transmit location information to the device (a telecoms engineer onced explained why to me, I evidently wasn’t sufficiently convinced to remember the explanation!). To get a location you need two things: the cell id and a database of cell ids against locations. Since I’m getting occasional locations that indicates that the database is sparse for Cambridge.

I wonder how Google’s database of cell locations is built up? From the help system:

“Google takes geo-contextual information [from anonymous GPS-readings, etc] and associates this information with the cell at that location to develop a database of cell locations.”

“Anonymous GPS-readings etc”? Location expert Andrew Grill has published an in-depth analysis of the location feature, and did a small experiment to find that Google’s database is built from GMM users with GPS enabled devices. So for me to get decent location information I need someone with a GPS enabled Orange phone running around Cambridge running GMM. Presumably the location will be some kind of average of the observed location-cell id points, so I’ll need a horde of Orange-GPS-GMM folks. Volunteers?

I’m completely unsurprised that the networks didn’t just give Google cell location information, and it seems this kind of social approach to building location services has been tried at least once before, using custom software on the phone.

Couldn’t Google accept user input within GMM to build the database faster? I know where I am (most of the time), and I’d be happy to center GMM on my current location and click “Here I am” to help build the database. I bet I’m not alone.

Location Deflation

November 29, 2007

Exciting! Exciting! Location gmaps mobile based on cell id! Downloading… (why doesn’t Opera let web servers know what the device it? Anyway…) … OK, and drumroll…

“Your current location is temporarily unavailable”


If my network have hidden their cell ids, I’m changing network.

SPECTRa released

November 28, 2007

Now that a number of niggling bugs have been ironed out, we’ve released a stable version of the SPECTRa tools.

There are prebuilt binaries for spectra-filetool (command line tool helpful for performing batch validation, metadata extraction and conversion of JCAMP-DX, MDL Mol and CIF files), and spectra-sub (flexible web application for depositing chemistry data in repositories). The source code is available from the spectra-chem package, or from Subversion. All of these are available from the spectra-chem SourceForge site.

Mavenites can obtain the libraries (and source code) from the SPECTRa maven repo at The groupId is – browse around for artifact ids and versions.


November 19, 2007

Thinking about interoperability last week I realised that in developing anti-RESTful works-despite-the-web applications, I quite possibly did human progress more harm than good. I wasn’t the worst, by a long chalk, but it made me feel baneful.

Toby provided a tonic: How do you know when you’re solving the wrong problem? When your solution involves a 133 page standard with a section entitled “Human Task Behavior and State Transitions”, just to allow a system to give tasks to people.

Round up 2007-11-16

November 16, 2007

More notables banging the REST drum.

A post by Jon Udell on tiny URLs for web citations, with a good comment from Peter Murray. A persistent redirecting service that automatically caches and preserves content? Throw in some access management and that sounds like a good part of an institutional repository.

The JISC funded eCrystals project began a fortnight ago, and “will establish a solid foundation of crystallography data repositories across an international group of partner sites”. It’s being led by the Simon Coles at the UK National Crystallographic Service, and we’re core partners along with UKOLN and the DCC. The project wiki is now available, so stick it in your aggregator to keep up to date.

eCrystals is an exciting opportunity for us to work to see the outcomes of SPECTRa and CrystalEye put into wider use – I’m looking forward to it!

I had planned to co-author a number of posts on CrystalEye with Nick Day, starting with the basic functionality in the web interface and moving on to the features in the new Atom feed. As things turned out Nick is rather busy with other things, the data archiving stuff caught everyone’s intention and my well laid plans ganged (gung?), as aft they do, agly (as Burns might have put it). Consequently I’m going to shove out CrystalEye posts as and when.

The point of this post is simply to demonstrate that Atom’s extensibility provides a way to combine several functionalities in the same feed, with the subtext that this makes it a promising alternative to CMLRSS. I’ve already written how the Atom feed can be used for data harvesting. This is something of a niche feature for a minority, though. The big news about the CrystalEye Atom feed is that it looks good in normal feed readers.

As a demonstration, here’s a CrystalEye CMLRSS feed in my aggregator: –

Text. Nice. Of course, I need a chemistry aggregator (like the one in Bioclipse) to make sense of a chemistry feed, right? Nope. Atom allows HTML content, so as well as including CML enclosures for chemistry aware aggregators, you can include diagrams: –

To quote PM-R: “Look – chemistry!”

I’ve been in a reflective mood about CrystalEye over the last few days. In repository-land where I spend part of my time, OAI-PMH is regarded as a really simple way of getting data from repositories, and approaches like Atom are often regarded as insufficiently featured. So I’ll admit I was a bit surprised about the negative reaction provoked by the idea of CrystalEye only providing incremental data feeds.

The “give me a big bundle of your raw data” request was one I’d heard before, from Rufus Pollock at OKFN, when I was working on the DSpace@Cambridge project, a topic he returned to yesterday, arguing that data projects should put making raw data available as a higher priority than developing “Shiny Front Ends” (SFE).

I agree on the whole. In a previous life working on public sector information systems I often had extremely frustrating conversations with data providers who didn’t see anything wrong in placing access restrictions on data they claimed was publicly available (usually the restriction was that any other gov / NGO could see the data but the public they served couldn’t).

When it comes to the issue with CrystalEye we’re not talking about access restriction, we’re talking about the form the data is made available, and the effort needed to obtain it. This is a familiar motif: –

  • The government has data that’s available if you ask in person, but that’s more effort than we’d like to expend, we’d like it to be downloadable
  • The publishers make (some) publications available as PDF, but analyzing the science requires manual effort, we’d like them to publish the science in a form that’s easier to process and analyze
  • The publishers make (some) data available from their websites, but it’s not easy to crawl the websites to get hold of it – it would be great if they gave us feeds of their latest data
  • CrystalEye makes CML data available, but potential users would prefer us to bundle it up onto DVDs and mail it to them.

Hold on, bit of a role reversal at the end there! Boot’s on the other foot. We have a reasonable reply; we’re a publicly funded research group who happen to believe in Open Data, not a publicly funded data provider. We have to prioritise our resources accordingly, but I still think the principle of providing open access to the raw data applies.

You’ll have to excuse a non-chemist stretching a metaphor: There’s an activation energy between licensing data as open, and making it easy to access and use. CrystalEye has made me wonder how much of this energy has to come from the provider, and how much from the consumer.