I’m happy to report that the code for the “Chemistry Add-In for Word” (from the Chem4Word project) has been released under an Open Source license.

At the start of Chem4Word the agreement was that the non Word-specific bits of the project would be released OS, and that the rest of the add-in would be made freely available, but not OS. Even then, though, Microsoft (and especially Microsoft Research) had evidently started a remarkable reorientation with respect to Open Source licenses, and I’m proud that Chem4Word is one of the first fruits of that.

Congratulations to everyone involved in the programming of this great software (especially to Joe), and thanks to Alex Wade, Oscar Naim and Lee Dirks at Microsoft Research for pushing the opening of C4W forward.

I’ve been at Hinxton Hall yesterday and today, for the CDK workshop 2009. I was very impressed: a very engaged and engaging bunch of attendees, and the facilities and setting were absolutely top notch. They even booked perfect April weather where the spring morning mist dissolves to bathe the idyllic cobble-and-thatch villages in clean, mellow sunlight. All very poetic and inspiring. Reminded me why I like living round here.

I digress.

The workshop ended with an unconference element, in which my proposed theme of chatting about clojure proved popular enough to run. I gave an unstructured and as-it-occured-to-me introduction to various aspects of clojure, then pointed folks at the install docs and answered questions as I could. My personal agenda was to work out how to use CDK to read in CML molecules, generate 2D co-ordinates and then render them to a PNG file. It proved to be … involved… and I wouldn’t have prevailed without the assistance of Gilleain Torrance (so thank-you again Gilleain!). It’s a little rough on the clojure side, and understandably heavy on Java interop features. As far I understand it, it will only work on the “jchempaint-primary” branch in CDK; let’s hope that gets merged with trunk in time for the 1.2.2 release. Here’s the code: –

;; render2png

(import '(org.openscience.cdk ChemFile Molecule)
'(org.openscience.cdk.layout StructureDiagramGenerator)
'(org.openscience.cdk.io CMLReader)
'(org.openscience.cdk.renderer Renderer)
'(org.openscience.cdk.renderer.font AWTFontManager)
'(org.openscience.cdk.renderer.generators BasicBondGenerator BasicAtomGenerator)
'(org.openscience.cdk.renderer.visitor AWTDrawVisitor)
'(org.openscience.cdk.tools.manipulator ChemFileManipulator)
'(javax.imageio ImageIO)
'(java.awt.image BufferedImage)
'(java.awt Color Rectangle)
'(java.io File FileInputStream)
'(java.util ArrayList))

(defn layout [mol]
(def sdg (StructureDiagramGenerator. mol))
(.generateCoordinates sdg)
(.getMolecule sdg))

(defn render [mol file]
"Renders a PNG of a mol"
(def laid-out (layout mol))
(def bimage (BufferedImage. 300, 300, (BufferedImage/TYPE_BYTE_INDEXED)))
(def g (.createGraphics bimage))
(doto g
(.setBackground (Color/WHITE))
(.setColor (Color/WHITE))
(.fillRect 0, 0, 300, 300))
(def generators (ArrayList.))
(doto generators
(.add (BasicBondGenerator.))
(.add (BasicAtomGenerator.)))
(def r (Renderer. generators (AWTFontManager.)))
(.setScale r laid-out)
(. r paintMolecule laid-out (AWTDrawVisitor. g) (Rectangle. 300 300) true)
(. ImageIO write bimage, "png", file))

(defn readmol
"Reads a CML file and returns a molecule"
(def reader (CMLReader. (FileInputStream. filename)))
(def cf (.read reader (ChemFile.)))
(Molecule. (first (. ChemFileManipulator getAllAtomContainers cf))))

;invoke using e.g.
(render (readmol "my.cml") (File. "my.png"))

SPECTRa released

November 28, 2007

Now that a number of niggling bugs have been ironed out, we’ve released a stable version of the SPECTRa tools.

There are prebuilt binaries for spectra-filetool (command line tool helpful for performing batch validation, metadata extraction and conversion of JCAMP-DX, MDL Mol and CIF files), and spectra-sub (flexible web application for depositing chemistry data in repositories). The source code is available from the spectra-chem package, or from Subversion. All of these are available from the spectra-chem SourceForge site.

Mavenites can obtain the libraries (and source code) from the SPECTRa maven repo at http://spectra-chem.sourceforge.net/maven2/. The groupId is uk.ac.cam.spectra – browse around for artifact ids and versions.

The JISC funded eCrystals project began a fortnight ago, and “will establish a solid foundation of crystallography data repositories across an international group of partner sites”. It’s being led by the Simon Coles at the UK National Crystallographic Service, and we’re core partners along with UKOLN and the DCC. The project wiki is now available, so stick it in your aggregator to keep up to date.

eCrystals is an exciting opportunity for us to work to see the outcomes of SPECTRa and CrystalEye put into wider use – I’m looking forward to it!

I had planned to co-author a number of posts on CrystalEye with Nick Day, starting with the basic functionality in the web interface and moving on to the features in the new Atom feed. As things turned out Nick is rather busy with other things, the data archiving stuff caught everyone’s intention and my well laid plans ganged (gung?), as aft they do, agly (as Burns might have put it). Consequently I’m going to shove out CrystalEye posts as and when.

The point of this post is simply to demonstrate that Atom’s extensibility provides a way to combine several functionalities in the same feed, with the subtext that this makes it a promising alternative to CMLRSS. I’ve already written how the Atom feed can be used for data harvesting. This is something of a niche feature for a minority, though. The big news about the CrystalEye Atom feed is that it looks good in normal feed readers.

As a demonstration, here’s a CrystalEye CMLRSS feed in my aggregator: –

Text. Nice. Of course, I need a chemistry aggregator (like the one in Bioclipse) to make sense of a chemistry feed, right? Nope. Atom allows HTML content, so as well as including CML enclosures for chemistry aware aggregators, you can include diagrams: –

To quote PM-R: “Look – chemistry!”

One of the features of the Crystaleye atom feeds is that they can be used for harvesting data from the system. This is not a feature of Atom syndication itself, but of proposed standard extension (RFC5005). So what does it look like?

RFC5005 specifies three different types of historical feed, we’re only interested at the moment in “Archived feeds”. An archived feed document must include an element like this: –

Basic harvesting is achieved extremely simply, get hold of the latest feed document from http://wwmm.ch.cam.ac.uk/crystaleye/feed/atom/feed.xml, and iterate through the entries. Each entry contains (amongst other things), a unique identifier (a URN UUID), and a link to the CML file: –


So getting the data is just a matter of doing a little XPath or DOM descent and using the link href to GET the data. When you’ve got all the entries, you need to follow a link to the previous (next oldest) feed document in the archive, encoded like this: –

(This ‘prev-archive’ rel is the special sauce added by RFC5005). Incremental harvesting is done by the same mechanism, but with a couple of extra bells and whistles to minimize bandwidth and redundant downloads. There are three ways you might do this: –

  • The first way is to keep track of all the entry IDs you’ve seen, and to stop when you see an entry you’ve already seen.
  • The easiest way is to keep track of the time you last harvested, and add an If-Modified-Since header to the HTTP requests when you harvest – when you receive a 304 (Not Modified) in return, you’ve finished the increment.
  • The most thorough way is to keep track of the ETag header returned with each file, and use it in the If-None-Match header in your incremental harvest. Again, this will return 304 (Not Modified) whenever your copy is good.

Implementing a harvester

Atom archiving is easy to code to in any language with decent HTTP and XML support. As an example, I’ve written a Java harvester (binary, source). The source builds with Maven2. The binary can be run using

java -jar crystaleye-harvester.jar [directory to stick data in]

Letting this rip for a full harvest will take a while, and will take up ~10G of space (although less bandwidth since the content is compressed).

Being a friendly client

First and foremost, please do not multi-thread your requests.

Please put a little delay in between requests. A few 100ms should be enough; the sample harvester uses 500ms – which should be as much as we need.

If you send an HTTP header “Accept-Encoding: gzip,deflate”, CrystalEye will send the content compressed (and you’ll need to gunzip it at the client end). This can save a lot of bandwidth, which helps.

Agents & Eyeballs

October 2, 2007

Peter has mentioned that we’ve been writing a bid to the JISC Capital Call. Well, it’s in, but no thanks at all to OpenOffice, NeoOffice or Word. I manage to avoid using word processors for most of my working life, and writing and collating this bid has been a pointed reminder why. Word 2004 for Mac wouldn’t read Word 2003 files at all and only read bits of Word XP, Word 95 etc etc etc files. I did most of the work in OpenOffice (on linux, neooffice on mac), which did it’s utmost to make Word look good by crashing regularly.

I wonder if any CSS implementations are up to doing paragraph numbering and pagination on HTML? Otherwise I’m going to have to re-learn latex next time!

Thanks are due, though to those who commented on Peter’s blog, or wrote posts of their own in response. Although the JISC bids are largely marked on the quality of the bid itself, no-one who looks can doubt the community engagement and vitality, which were important components in the call for funding. So thanks to you all!

Hopefully I’ll get to write more about the project particulars in due course. We obviously don’t want to get scooped, but on the other hand this is interesting work that I’ve wanted to look at for a while, so we’ll look for other funding if we’re not successful with JISC.

Jumbo 5.4 released

September 25, 2007

The final release of Jumbo 5.4 is available from the sourceforge downloads page, or from the WWMM maven repository at http://wwmm.ch.cam.ac.uk/maven2/ with g:cml a:jumbo v:5.4

The code formerly known as CIFDOM is now in its new home at https://cml.svn.sourceforge.net/svnroot/cml/cifxml/trunk . The package for the library is now org.xmlcml.cif (rather than uk.co.demon.ursus.cif).

Maven users obtain the new package from the WWMM repo at http://wwmm.ch.cam.ac.uk/maven2/ groupId:cml artefactId:cifxml version:1.2-SNAPSHOT.

The latest CIFDOM code is still available from sourceforge or the WWMM repo g:cml a:cifdom v:1.1

Why InChIKey?

September 12, 2007

Egon Willighagen has posted on the release of the latest InChI software. Egon (and others) are concerned about the implementation, especially that InChIKeys aren’t guaranteed unique. At a more basic level, I’m wondering whether people agree with the stated needs for InChIKey.

Facilitate web searching
Even though Google are coping with InChI very well, having a representation of InChI that didn’t break standard tokenization routines, and that could be attractively included in prose would be handy.
Allow development of a web based lookup service
Not really sure what’s meant here. As Egon pointed out in the comments to his post, he already has one of these, and it didn’t require InChIKey!
Permit an InChI representation to be stored in fixed length fields make chemical structure database indexing easier
Because RDBMSs have such a hard time indexing VARCHAR? Really?
Allow verification of InChI strings after network transmission
This is not a problem that needs solving again – using MD5SUMs would do the same job.

I make that one out of four and would argue that the only problem with InChI is the length of the identifiers and the issues caused by the characters used. This could be solved by having a centralized service that assigned short HTTP URLs for InChIs, ensuring a one to one relationship between InChIs and shorthand URLs.