An extremely interesting post by Jim King (PDF architect and Senior Principle Scientist at Adobe), explaining why getting text back out of PDF is a pain. In his words: –

To extract text from PDF documents is a rather difficult and a highly technical task…

We’ve had plenty of problems with PDF in our data mining efforts. Leaving aside the basic problem that most of the data we care about can’t be encoded in PDF at all, we’ve had plenty of problems extracting text too. Jim explains some of the reasons why and concludes: –

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

PDF is not designed primarily for text. In PDF, typography trumps text.

ODF fillip

May 22, 2008

I’ve resisted getting drawn into the OOXML scrap over on PM-R’s blog; partly because I’ve had plenty to do, and mostly because I don’t think another partly informed opinion would add much to the debate.

Our approach to text mining is necessarily pragmatic, which changes your outlook significantly (for detailed reasons why, read Peter Sefton’s blog). OOXML may be a flawed spec born of a standardisation process that left its participants disenchanted and angry. It may be that OOXML can only ever be implemented meaningfully by Word. The fact remains that most chemists, most people, use Word to create documents.

Which is why the news that Office 2007 SP2 introduces native support for ODF natively is brightening my day.

ODF has a potential upside in expanding interoperability, but as always, business continuity requirements will have a significant effect on our approach to these file format changes.

Gray Knowlton

When ODF became a standard I hoped Microsoft would see the business advantage of open data specs and interoperability, and start playing along. Looks like we’re getting there.

My congratulations to Peter, Peter, Colin, Richard and all involved in making Project Prospect such a success.

I’ve (very belatedly) deployed a binary of Peter Corbett’s OSCAR3 release alpha 2 to the WWMM maven2 repository ( Use groupId:wwmm, artifactId:oscar, version:3a2.

Caveat: The OSCAR jar includes all it’s dependencies, so this jar might not play nicely if you’re using any of its dependencies, including lucene, cdk and weka. I’m hoping to persuade Peter to let me mavenize OSCAR in the near future which will sort this problem out.