Open Repositories 2007 – Friday: Chemistry special!

January 29, 2007

For me, Friday turned out to be the highlight of the week, and I wish I could have live blogged it. Unfortunately I positioned myself too far forward in the hall where the wireless was weak.

Of the presentations that morning, two were especially relevant to the SPECTRa project, the (hopefully!) upcoming SPECTRa-Theses project and the work of the Murray-Rust group as a whole; Lee Giles talking about ChemXSeer and the closing keynote from Tony Hey.

ChemXSeer is taking on some really important problems in ChemoInformatics, most particularly the paucity of the data commons in chemistry (their specific area is environmental chemistry, but most of the issues and tools he presented looked as if they would transfer to other chemistry specialisms). ChemXSeer has tools for identifying chemical entities (compound and reaction names and so on) in journal papers, ontologies for chemistry (PMR group: is this sounding at all familiar?!), and even tools for extracting useful data from tables and figures in papers. Not simple problems to crack; it’ s awesome that they’re taking them on.

The approaches of the PMR group, when it comes to getting chemical data, are two-headed; both extracting useful chemistry data from sources where it is badly encoded (e.g. in English), but also by improving ways of publishing data in the first place (with the work on the uses of InChI and CML).

I asked Lee how they approached data quality (perhaps hoping they were setting up protocols for CML publishing) – he replied that pragmatically they found it best to extract data from the papers and then offer it back to the authors for correction and annotation, rather than set high requirements for deposition. The evening before I had been at dinner with Peter Sefton (amongst others) who shared a tip on improving quality in MS Word authored theses; his system periodically shows the author the product of converting their document into PDF / HTML. In his experience authors quickly learn not to override the structural markup with hacky font changes! This kind of feedback sytem would work well with data also, I think, allowing authors to work with familiar creation tools whilst encouraging them to improve the usefulness of their output.

Tony Hey’s presentation was less chemistry specific, but great. He painted an attractive vision of an Open Data future, pointing out the opportunities and challenges along the way.

It was a little strange having the virtues of openness sung in what was, in a way, a Microsoft keynote. HP are the only big tech company I’ve noticed with a visible involvement in IRs so far (obviously a skewed view, being a DSpacer by trade). Sun on the hardware side, I suppose. I still don’t really understand what MS intend to do in the IR area; where in-between “do some good and maybe sell some licenses on the way” and “nice sector, we’ll take it” they want to go. Tony, being a fairly recent MS acquisition, pointed out that MS wasn’t fundamentally anti-OS, just anti-copyleft. Personally I’m fine with that – I go for “‘Free’ as in ‘free'” as well, but I wonder how the ePrints guys were feeling at that point.

To close things finally the ever-charismatic Les Carr announced next year’s Open Repositories conference in Southampton. See you there!


