CrystalEye is a repository of crystallographic data. It’s built by a software system written by Nick Day that uses sections of Jumbo and CDK for functionality. It isn’t feasible for Nick to curate all this data (>100,000 structures) manually, and software bugs are a fact of life, so errors creep in.

Egon Willighagen and Antony Williams (ChemSpiderMan) have been looking at the CrystalEye data, and have used their blogs (as well as commenting on PM-R’s) to feed issues back. This is a great example of community data checking. Antony suggested that we implement a “Post a comment” feature on each page to make feedback easier. This is a great idea, so we had a quick think about it and propose a web2.0 alternative mechanism: Connotea.

To report a problem in CrystalEye, simply bookmark an example of the problem with the tag “crystaleyeproblem”, using the Description field to describe the problem. All the problems will appear on the tag feed.

When we fix the problem we’ll add the tag “crystaleyefixed” to the same bookmark. If you subscribe to this feed, you’ll know to remove the crystaleyeproblem tag.

In the fullness of time, we’re planning to use connotea tags to annotate structures where full processing hasn’t been possible (uncalculatable bond orders, charges etc).

Jonathan Gray has written some notes from the barcamp I attended (part of) on Saturday, about creating a web channel for public sector information (PSI) requests.

I had a few areas of interest that got touched on in discussion, but I’m not entirely sure where it got us, since I had to leave early to go ice-skating outside the Natural History Museum 🙂

Where’s my license?

One of the pleasant surprises for myself (and several others at the barcamp) was that most PSI is already available through an open license, the PSI license. To obtain the license, one simple visits, and fills in an online form. The license is (AFAIK) eternal and covers all PSI. Wonderful, but not hitherto useful, because a license you don’t know about it as much use as no license. To the Open Source techies in the room the answer is obvious: explicit information about the copyright and licensing must be included with the data whenever it is distributed.

There are strong analogues with Peter M-R’s campaign to promote licensing and copyright clarity in the publication of scientific data, but in this case it might actually happen since OPSI have a vested interest in promoting awareness of their open licenses.

Non-commercial licenses considered harmful

What license you can obtain from OPSI for PSI depends on two things; 1) whether the information was collected as part of ‘core’ government activities and 2) whether you are a commercial entity. The license is only freely reusable if the answers are ‘yes’ and ‘no’. On the first question, Michael Cross of the Free Our Data campaign made a good point: Should the government be in the business of creating value added data products at all?

The second question is the one that got my attention. The PSI license (the open one) is an attribution non-commercial type license that crucially allows re-use. The problems with applying a non-commercial constraint were forcefully made by several of the people in the room who have spent their spare time setting up awesome sites using government data, but would need to negotiate a license if they wanted to become commercial in order to offset their costs. Another good example here is Non-Governmental Organizations who are often also commercial entities, but are an essential part of the PSI infrastructure. OPSI gave the impression that they are approachable on these issues, but having a clear license would promote reuse far better than approachability.

I’ve been persuaded by Rufus on this issue – a sharealike license without restrictions on who can reuse would ensure more freedoms than an attribution with non-commercial restriction.

Finally, if this stuff floats your boat, keep a look out for the new Freedom Of Information request site from – we got to see a beta and it looks like it’s going to be a lot of fun!

Update More coverage on the barcamp from Michael Cross.

I’ve been in a reflective mood about CrystalEye over the last few days. In repository-land where I spend part of my time, OAI-PMH is regarded as a really simple way of getting data from repositories, and approaches like Atom are often regarded as insufficiently featured. So I’ll admit I was a bit surprised about the negative reaction provoked by the idea of CrystalEye only providing incremental data feeds.

The “give me a big bundle of your raw data” request was one I’d heard before, from Rufus Pollock at OKFN, when I was working on the DSpace@Cambridge project, a topic he returned to yesterday, arguing that data projects should put making raw data available as a higher priority than developing “Shiny Front Ends” (SFE).

I agree on the whole. In a previous life working on public sector information systems I often had extremely frustrating conversations with data providers who didn’t see anything wrong in placing access restrictions on data they claimed was publicly available (usually the restriction was that any other gov / NGO could see the data but the public they served couldn’t).

When it comes to the issue with CrystalEye we’re not talking about access restriction, we’re talking about the form the data is made available, and the effort needed to obtain it. This is a familiar motif: –

  • The government has data that’s available if you ask in person, but that’s more effort than we’d like to expend, we’d like it to be downloadable
  • The publishers make (some) publications available as PDF, but analyzing the science requires manual effort, we’d like them to publish the science in a form that’s easier to process and analyze
  • The publishers make (some) data available from their websites, but it’s not easy to crawl the websites to get hold of it – it would be great if they gave us feeds of their latest data
  • CrystalEye makes CML data available, but potential users would prefer us to bundle it up onto DVDs and mail it to them.

Hold on, bit of a role reversal at the end there! Boot’s on the other foot. We have a reasonable reply; we’re a publicly funded research group who happen to believe in Open Data, not a publicly funded data provider. We have to prioritise our resources accordingly, but I still think the principle of providing open access to the raw data applies.

You’ll have to excuse a non-chemist stretching a metaphor: There’s an activation energy between licensing data as open, and making it easy to access and use. CrystalEye has made me wonder how much of this energy has to come from the provider, and how much from the consumer.

Agents & Eyeballs

October 2, 2007

Peter has mentioned that we’ve been writing a bid to the JISC Capital Call. Well, it’s in, but no thanks at all to OpenOffice, NeoOffice or Word. I manage to avoid using word processors for most of my working life, and writing and collating this bid has been a pointed reminder why. Word 2004 for Mac wouldn’t read Word 2003 files at all and only read bits of Word XP, Word 95 etc etc etc files. I did most of the work in OpenOffice (on linux, neooffice on mac), which did it’s utmost to make Word look good by crashing regularly.

I wonder if any CSS implementations are up to doing paragraph numbering and pagination on HTML? Otherwise I’m going to have to re-learn latex next time!

Thanks are due, though to those who commented on Peter’s blog, or wrote posts of their own in response. Although the JISC bids are largely marked on the quality of the bid itself, no-one who looks can doubt the community engagement and vitality, which were important components in the call for funding. So thanks to you all!

Hopefully I’ll get to write more about the project particulars in due course. We obviously don’t want to get scooped, but on the other hand this is interesting work that I’ve wanted to look at for a while, so we’ll look for other funding if we’re not successful with JISC.

Laura is chief techie for AlertMe – a startup taking another bash at home automation. The idea looks cool, and although a little outside my immediate bailiwick there are a couple of overlaps: – can the platform be opened up to partner service providers – how can you manage the security and data protection issues around doing so?

Matt repeated the premise of microformats, that content authors won’t do “big” SW (by which he means RDF, SPARQL and their ilk) and extends this to scientists, and showed us the simple examples used in the Ensembl gene browser. Matt emphasised the benefits of de facto standardisation (rather than the W3C style approach taken by

There was a very positive discussion about GRDDL afterwards. There was quite a bit of emphasis on how GRDDL allows you to disconnect the microformat mark up and the semantics of the data. I’m a bit worried by this – it would mean that semantic web specialists rather than the domain specialists ended up doing the job of standardising the data model. It would be better to keep on standardising in the microformat domain and just using GRDDL as a bridge to the RDF world. That way the data is still standard and still useful without having to cross over to RDF.

SPECTRa on Sourceforge

May 24, 2007

I’ve just finished moving the SPECTRa code to the spectra-chem site. The technical web sites will be moving over in due course, and I’ll be switching issue tracking to the sourceforge trackers too.

Wow, I hope the US gov goes ahead with this plan to put all data from public science into open repositories. A short article can’t, of course, capture either the enormity of the difficulties involved, or the incredible benefits of success.

SPECTRaT is go!

March 16, 2007

The application for funding for SPECTRa-T (Theses) from JISC has been successful. SPECTRaT will kick off at the start of April, and will be looking at extracting data from chemistry theses and depositing both in digital repositories, and will be collaborating with researchers on the SciBorg project for natural language processing.

Congratulations to all the SPECTRa team!

This was almost inevitable, with the benefit of hindsight. It seems that the phrase “Open Data” is now being used in the context of transparency / translucency of attention data.

Do we stick to our guns or find a more specific phrase to use?